Performance Rankings

AI Benchmarks

Comprehensive test results and leaderboard rankings
16 benchmarks covering reasoning, coding, knowledge, and more

← Back to Portal

Benchmark Leaderboard

Overall ranking by composite benchmark average (MMLU, HumanEval, GPQA, SWE-bench, HellaSwag, AIME 2025)

Rank Model Developer Overall MMLU GPQA SWE-bench Context
1 GPT-5.2 OpenAI 90.3 92.4% 80.0% 400K
2 Gemini 3.1 Pro Google 90.22 94.3% 80.6% 1M
3 Claude Opus 4.6 Anthropic 89.6 91.3% 80.8% 200K
4 Claude Opus 4.5 Anthropic 88.82 80.9% 200K
5 GPT-5.3 Codex OpenAI 88.62 92.4% 400K
6 Kimi K2.5 Moonshot 92.0 87.6% 262K
7 MiniMax M2.5 MiniMax 85.2% 80.2% 205K
8 Qwen 3.5 Alibaba 88.4% 76.4% 1M

Key Benchmarks Explained

ARC-AGI-2 Novel reasoning puzzles that can't be memorized
GPQA Diamond PhD-level science questions across physics, chemistry, biology
SWE-bench Real GitHub issue resolution from production codebases
HumanEval Functional coding problems testing programming ability
MMLU Multi-task language understanding across 57 subjects
AIME 2025 High school math competition problems

Model Spotlights

Gemini 3.1 Pro

Benchmark Leader
ARC-AGI-2
77.1%
GPQA Diamond
94.3%
SWE-bench
80.6%
Terminal-Bench
68.5%

Best for: Agentic work, multi-step reasoning, large-context tasks

Claude Opus 4.6

Coding King
SWE-bench
80.8%
GPQA Diamond
91.3%
ARC-AGI-2
68.8%
Humanity's Last Exam
53.1%

Best for: Complex coding, deep reasoning, research tasks

Kimi K2.5

Open Weight
HumanEval
99.0%
MMLU
92.0
MATH-500
98.0%
AIME 2025
96.1%

Best for: Coding, math, on-premise deployment (1T params, 32B active)

API Pricing Comparison

Per 1 million tokens (USD)

Value Tier

Qwen 3.5 $0.40 / $1.20
Gemini 2.5 Flash Lite $0.10 / $0.40
Gemini 2.5 Flash $0.30 / $2.50

Premium Tier

Claude Opus 4.6 $5.00 / $25.00
GPT-5.2 Premium

Consumer chat plans: Gemini Advanced ~HK$148/mo, Claude Pro ~HK$133/mo, ChatGPT Plus ~HK$125/mo