Comprehensive test results and leaderboard rankings
19 benchmarks covering reasoning, coding, computer use, and real-world tasks
Overall ranking by composite benchmark average (MMLU, HumanEval, GPQA, SWE-bench, HellaSwag, AIME 2025)
| Rank | Model | Developer | Overall | MMLU | GPQA | SWE-bench | Context |
|---|---|---|---|---|---|---|---|
| 1 | Claude Mythos (GATED) | Anthropic | — | — | — | — | 10T |
| 2 | GPT-5.4 | OpenAI | 91.5 | — | 92.8% | 81.2% | 1M |
| 3 | Gemini 3.1 Pro | 90.22 | — | 94.3% | 80.6% | 1M | |
| 4 | Claude Opus 4.6 | Anthropic | 89.6 | — | 91.3% | 80.8% | 200K |
| 5 | GPT-5.2 | OpenAI | 90.3 | — | 92.4% | 80.0% | 400K |
| 6 | GPT-5.3 Codex | OpenAI | 88.62 | — | 92.4% | 79.8% | 400K |
| 7 | GLM-5.1 | Zhipu AI | — | — | — | 81.5% | 200K |
| 8 | Claude Sonnet 4.6 | Anthropic | 88.4 | — | 87.0% | 78.0% | 1M (β) |
| 9 | Kimi K2.5 | Moonshot | — | 92.0 | 87.6% | — | 262K |
| 10 | DeepSeek V4 | DeepSeek | 87.9 | — | 86.8% | 78.5% | 128K |
| 11 | MiniMax M2.7 | MiniMax | 87.8 | — | 86.5% | 79.5% | 205K |
| 12 | Qwen 3.5 | Alibaba | — | — | 88.4% | 76.4% | 1M |
| 13 | MiMo-V2-Pro | Xiaomi | 87.2 | — | 85.8% | 77.8% | 1M |
| 14 | Mistral Small 4 | Mistral | 86.5 | — | 84.2% | 76.0% | 128K |
| 15 | Gemma 4 (27B) | — | — | 82.5% | 72.0% | 128K |
Best for: Autonomous computer use, browser automation, desktop workflows, long-context reasoning
Best for: Agentic work, multi-step reasoning, large-context tasks
Best for: Complex coding, deep reasoning, research tasks
Best for: Near-Opus performance at 60% of the price, long context, agentic tasks
Best for: Coding (beats GPT-5.4), self-hosting, MIT license, 200K context
Best for: Defensive cybersecurity, vulnerability discovery, multi-hop reasoning. Not publicly benchmarked.
Best for: Coding, math, on-premise deployment (1T params, 32B active)
Per 1 million tokens (USD)
Consumer chat plans: Gemini Advanced ~HK$148/mo, Claude Pro ~HK$133/mo, ChatGPT Plus ~HK$125/mo