AI Benchmarks | AI Pulse

Benchmark Leaderboard

Overall ranking by composite benchmark average (MMLU, HumanEval, GPQA, SWE-bench, HellaSwag, AIME 2025)

Rank	Model	Developer	Overall	MMLU	GPQA	SWE-bench	Context
1	Claude Mythos (GATED)	Anthropic	—	—	—	—	10T
2	GPT-5.4	OpenAI	91.5	—	92.8%	81.2%	1M
3	Gemini 3.1 Pro	Google	90.22	—	94.3%	80.6%	1M
4	Claude Opus 4.6	Anthropic	89.6	—	91.3%	80.8%	200K
5	GPT-5.2	OpenAI	90.3	—	92.4%	80.0%	400K
6	GPT-5.3 Codex	OpenAI	88.62	—	92.4%	79.8%	400K
7	GLM-5.1	Zhipu AI	—	—	—	81.5%	200K
8	Claude Sonnet 4.6	Anthropic	88.4	—	87.0%	78.0%	1M (β)
9	Kimi K2.5	Moonshot	—	92.0	87.6%	—	262K
10	DeepSeek V4	DeepSeek	87.9	—	86.8%	78.5%	128K
11	MiniMax M2.7	MiniMax	87.8	—	86.5%	79.5%	205K
12	Qwen 3.5	Alibaba	—	—	88.4%	76.4%	1M
13	MiMo-V2-Pro	Xiaomi	87.2	—	85.8%	77.8%	1M
14	Mistral Small 4	Mistral	86.5	—	84.2%	76.0%	128K
15	Gemma 4 (27B)	Google	—	—	82.5%	72.0%	128K

Key Benchmarks Explained

ARC-AGI-2 Novel reasoning puzzles that can't be memorized

GPQA Diamond PhD-level science questions across physics, chemistry, biology

SWE-bench Real GitHub issue resolution from production codebases

HumanEval Functional coding problems testing programming ability

MMLU Multi-task language understanding across 57 subjects

OSWorld Computer use: controlling desktop apps and browsers autonomously

GDPval Real-world knowledge work across 44 professional occupations

Terminal-Bench Autonomous DevOps tasks in terminal/shell environments

AIME 2025 High school math competition problems

Model Spotlights

GPT-5.4

Computer Use Pioneer

OSWorld

75.0% ★

GDPval

83.0%

SWE-bench

81.2%

GPQA Diamond

92.8%

Best for: Autonomous computer use, browser automation, desktop workflows, long-context reasoning

Gemini 3.1 Pro

Benchmark Leader

ARC-AGI-2

77.1%

GPQA Diamond

94.3%

SWE-bench

80.6%

Terminal-Bench

68.5%

Best for: Agentic work, multi-step reasoning, large-context tasks

Claude Opus 4.6

Coding King

SWE-bench

80.8%

GPQA Diamond

91.3%

ARC-AGI-2

68.8%

Humanity's Last Exam

53.1%

Best for: Complex coding, deep reasoning, research tasks

Claude Sonnet 4.6

Best Value

SWE-bench

78.0%

GPQA Diamond

87.0%

MMLU

88.4%

Context Window

1M (β)

Best for: Near-Opus performance at 60% of the price, long context, agentic tasks

GLM-5.1

Open Source Champion

SWE-Bench Pro

81.5% ★

SWE-Bench Verified

79.0%

HumanEval

96.0%

Parameters

744B MoE

Best for: Coding (beats GPT-5.4), self-hosting, MIT license, 200K context

Claude Mythos

Gated Access

Parameters

10 Trillion

Cyber Capability

State-of-the-Art

Context Window

Unknown

Access

50 Organizations

Best for: Defensive cybersecurity, vulnerability discovery, multi-hop reasoning. Not publicly benchmarked.

Kimi K2.5

Open Weight

HumanEval

99.0%

MMLU

92.0

MATH-500

98.0%

AIME 2025

96.1%

Best for: Coding, math, on-premise deployment (1T params, 32B active)

API Pricing Comparison

Per 1 million tokens (USD)

Value Tier

GPT-5.4 Nano $0.20 / $0.80

Qwen 3.5 $0.40 / $1.20

Gemini 2.5 Flash Lite $0.10 / $0.40

Gemini 2.5 Flash $0.30 / $2.50

Standard Tier

GPT-5.4 $2.50 / $15.00

Gemini 3.1 Pro $2.00 / $12.00

Claude Sonnet 4.6 $3.00 / $15.00

GLM-5.1 $1.00 / $3.20

GPT-5.3 Codex ~$1.25 / ~$10.00

Premium Tier

Claude Mythos $25.00 / $125.00

Claude Opus 4.6 $5.00 / $25.00

GPT-5.2 $1.25 / $10.00

Fast Tier

GPT-5.4 Mini $0.75 / $4.50

2x faster than Mini predecessor 400K context

Consumer chat plans: Gemini Advanced ~HK$148/mo, Claude Pro ~HK$133/mo, ChatGPT Plus ~HK$125/mo