Performance Rankings

AI Benchmarks

Comprehensive test results and leaderboard rankings
19 benchmarks covering reasoning, coding, computer use, and real-world tasks

← Back to Portal

Benchmark Leaderboard

Overall ranking by composite benchmark average (MMLU, HumanEval, GPQA, SWE-bench, HellaSwag, AIME 2025)

Rank Model Developer Overall MMLU GPQA SWE-bench Context
1 Claude Mythos (GATED) Anthropic 10T
2 GPT-5.4 OpenAI 91.5 92.8% 81.2% 1M
3 Gemini 3.1 Pro Google 90.22 94.3% 80.6% 1M
4 Claude Opus 4.6 Anthropic 89.6 91.3% 80.8% 200K
5 GPT-5.2 OpenAI 90.3 92.4% 80.0% 400K
6 GPT-5.3 Codex OpenAI 88.62 92.4% 79.8% 400K
7 GLM-5.1 Zhipu AI 81.5% 200K
8 Claude Sonnet 4.6 Anthropic 88.4 87.0% 78.0% 1M (β)
9 Kimi K2.5 Moonshot 92.0 87.6% 262K
10 DeepSeek V4 DeepSeek 87.9 86.8% 78.5% 128K
11 MiniMax M2.7 MiniMax 87.8 86.5% 79.5% 205K
12 Qwen 3.5 Alibaba 88.4% 76.4% 1M
13 MiMo-V2-Pro Xiaomi 87.2 85.8% 77.8% 1M
14 Mistral Small 4 Mistral 86.5 84.2% 76.0% 128K
15 Gemma 4 (27B) Google 82.5% 72.0% 128K

Key Benchmarks Explained

ARC-AGI-2 Novel reasoning puzzles that can't be memorized
GPQA Diamond PhD-level science questions across physics, chemistry, biology
SWE-bench Real GitHub issue resolution from production codebases
HumanEval Functional coding problems testing programming ability
MMLU Multi-task language understanding across 57 subjects
OSWorld Computer use: controlling desktop apps and browsers autonomously
GDPval Real-world knowledge work across 44 professional occupations
Terminal-Bench Autonomous DevOps tasks in terminal/shell environments
AIME 2025 High school math competition problems

Model Spotlights

GPT-5.4

Computer Use Pioneer
OSWorld
75.0% ★
GDPval
83.0%
SWE-bench
81.2%
GPQA Diamond
92.8%

Best for: Autonomous computer use, browser automation, desktop workflows, long-context reasoning

Gemini 3.1 Pro

Benchmark Leader
ARC-AGI-2
77.1%
GPQA Diamond
94.3%
SWE-bench
80.6%
Terminal-Bench
68.5%

Best for: Agentic work, multi-step reasoning, large-context tasks

Claude Opus 4.6

Coding King
SWE-bench
80.8%
GPQA Diamond
91.3%
ARC-AGI-2
68.8%
Humanity's Last Exam
53.1%

Best for: Complex coding, deep reasoning, research tasks

Claude Sonnet 4.6

Best Value
SWE-bench
78.0%
GPQA Diamond
87.0%
MMLU
88.4%
Context Window
1M (β)

Best for: Near-Opus performance at 60% of the price, long context, agentic tasks

GLM-5.1

Open Source Champion
SWE-Bench Pro
81.5% ★
SWE-Bench Verified
79.0%
HumanEval
96.0%
Parameters
744B MoE

Best for: Coding (beats GPT-5.4), self-hosting, MIT license, 200K context

Claude Mythos

Gated Access
Parameters
10 Trillion
Cyber Capability
State-of-the-Art
Context Window
Unknown
Access
50 Organizations

Best for: Defensive cybersecurity, vulnerability discovery, multi-hop reasoning. Not publicly benchmarked.

Kimi K2.5

Open Weight
HumanEval
99.0%
MMLU
92.0
MATH-500
98.0%
AIME 2025
96.1%

Best for: Coding, math, on-premise deployment (1T params, 32B active)

API Pricing Comparison

Per 1 million tokens (USD)

Value Tier

GPT-5.4 Nano $0.20 / $0.80
Qwen 3.5 $0.40 / $1.20
Gemini 2.5 Flash Lite $0.10 / $0.40
Gemini 2.5 Flash $0.30 / $2.50

Premium Tier

Claude Mythos $25.00 / $125.00
Claude Opus 4.6 $5.00 / $25.00
GPT-5.2 $1.25 / $10.00

Fast Tier

GPT-5.4 Mini $0.75 / $4.50
2x faster than Mini predecessor 400K context

Consumer chat plans: Gemini Advanced ~HK$148/mo, Claude Pro ~HK$133/mo, ChatGPT Plus ~HK$125/mo