Comprehensive test results and leaderboard rankings
16 benchmarks covering reasoning, coding, knowledge, and more
Overall ranking by composite benchmark average (MMLU, HumanEval, GPQA, SWE-bench, HellaSwag, AIME 2025)
| Rank | Model | Developer | Overall | MMLU | GPQA | SWE-bench | Context |
|---|---|---|---|---|---|---|---|
| 1 | GPT-5.2 | OpenAI | 90.3 | — | 92.4% | 80.0% | 400K |
| 2 | Gemini 3.1 Pro | 90.22 | — | 94.3% | 80.6% | 1M | |
| 3 | Claude Opus 4.6 | Anthropic | 89.6 | — | 91.3% | 80.8% | 200K |
| 4 | Claude Opus 4.5 | Anthropic | 88.82 | — | — | 80.9% | 200K |
| 5 | GPT-5.3 Codex | OpenAI | 88.62 | — | 92.4% | — | 400K |
| 6 | Kimi K2.5 | Moonshot | — | 92.0 | 87.6% | — | 262K |
| 7 | MiniMax M2.5 | MiniMax | — | — | 85.2% | 80.2% | 205K |
| 8 | Qwen 3.5 | Alibaba | — | — | 88.4% | 76.4% | 1M |
Best for: Agentic work, multi-step reasoning, large-context tasks
Best for: Complex coding, deep reasoning, research tasks
Best for: Coding, math, on-premise deployment (1T params, 32B active)
Per 1 million tokens (USD)
Consumer chat plans: Gemini Advanced ~HK$148/mo, Claude Pro ~HK$133/mo, ChatGPT Plus ~HK$125/mo