Best for coding agents
Primary choice: GPT‑5.3 Codex. Alternative: GPT‑5.4 when broader office and computer-use capability also matters.
This report compares five frontier models using official documentation, vendor-reported benchmark results, and a normalized cross-check of overlapping model metadata. It is designed to help with practical model selection for coding agents, long-context reasoning, computer use, multimodal work, and professional knowledge workflows.
Short operational profiles focused on fit, strengths, and caveats rather than marketing language.
Filter, sort, and compare overlapping benchmark rows. Charts are rendered with Chart.js; the table is generated from the same underlying dataset to keep the page internally consistent.
Choose a benchmark from the filter panel to compare all models on a single row.
A simple category-level summary based only on the rows included in this report. Missing data is ignored rather than imputed.
| Benchmark | GPT‑5.3 Codex | GPT‑5.4 | Claude Sonnet 4.6 | Claude Opus 4.6 | Gemini 3 Pro |
|---|---|---|---|---|---|
Terminal-Bench 2.0 |
77.3% OpenAI self-reported, xhigh effort. |
75.1% OpenAI self-reported, xhigh effort. |
59.1% Anthropic-reported agentic terminal coding. |
65.4% Anthropic Terminus-2 harness, 5–15 samples per task. |
54.2% Google-reported via Terminus-2 agent. |
SWE-Bench Pro (Public) |
56.8% OpenAI self-reported, xhigh effort. |
57.7% OpenAI self-reported, xhigh effort. |
n/a | n/a | 54.2% Google-reported, single attempt. |
SWE-Bench Verified |
n/a | n/a | 79.6% Anthropic-reported agentic coding result. |
80.8% Anthropic average over 25 trials. |
76.2% Google-reported, single attempt. |
OSWorld / OSWorld-Verified |
64.7% OpenAI self-reported OSWorld-Verified. |
75.0% OpenAI self-reported OSWorld-Verified. |
72.5% Anthropic-reported OSWorld-Verified style result. |
72.7% Anthropic-reported computer-use evaluation. |
n/a |
BrowseComp |
n/a | 82.7% OpenAI self-reported, xhigh effort. |
74.7% Anthropic agentic search result. |
84.0% Anthropic run with search, fetch, tool calling, and compaction. |
n/a |
MCP Atlas |
n/a | 67.2% OpenAI self-reported, xhigh effort. |
61.3% Anthropic scaled tool-use result. |
59.5% Anthropic max-effort run; high-effort run reportedly scored higher than listed baseline in announcement notes. |
n/a |
GPQA Diamond |
n/a | 92.8% OpenAI self-reported, xhigh effort. |
89.9% Anthropic-reported GPQA Diamond. |
91.3% Anthropic-reported GPQA Diamond. |
91.9% Google-reported, no tools. |
Humanity’s Last Exam |
n/a | 39.8% OpenAI number is the no-tools score from the official announcement excerpt reviewed. |
49.0% Anthropic reported with-tools score; announcement also mentions 33.2% without tools. |
53.1% Anthropic reported with-tools score using search, fetch, code execution, and compaction. |
45.8% Google-reported search + code setup. |
MMMU-Pro |
n/a | 81.2% OpenAI self-reported no-tools score. |
75.6% Anthropic with-tools score; announcement also cites 74.5% without tools. |
77.3% Anthropic with-tools score; announcement cites 73.9% without tools. |
81.0% Google-reported multimodal score. |
ARC-AGI-2 |
n/a | 73.3% OpenAI self-reported ARC-AGI v2 verified result. |
58.3% Anthropic-reported ARC-AGI-2. |
68.8% Anthropic-reported max-effort result. |
31.1% Google ARC Prize Verified result. |
These are pragmatic starting points, not categorical prescriptions.
Primary choice: GPT‑5.3 Codex. Alternative: GPT‑5.4 when broader office and computer-use capability also matters.
Primary choice: GPT‑5.4. It offers the cleanest all-round capability mix in the OpenAI family covered here.
Primary choice: Claude Sonnet 4.6. It provides the most attractive balance between cost, quality, and practical versatility.
Primary choice: Claude Opus 4.6. Use it when failure costs are higher than model costs.
Primary choice: Gemini 3 Pro. Especially relevant for mixed media, large context, and broad reasoning-heavy workloads.
Primary choice: Claude Sonnet 4.6. Step up to Opus 4.6 only when task difficulty clearly justifies the premium.
Short descriptions of what each evaluation is actually trying to measure.
Primary sources were preferred. Aggregated metadata was used cautiously to normalize overlapping rows and reduce repetitive extraction work.
The aggregator was used as a normalization aid, not as an unquestioned authority. Where an official source and an aggregated summary diverged, the official source or an explicit caveat was preferred.