Frontier model report

Codex 5.3, GPT‑5.4, Sonnet 4.6, Opus 4.6, Gemini 3 Pro

This report compares five frontier models using official documentation, vendor-reported benchmark results, and a normalized cross-check of overlapping model metadata. It is designed to help with practical model selection for coding agents, long-context reasoning, computer use, multimodal work, and professional knowledge workflows.

ScopeSpecifications, benchmark context, use-case fit, and evaluation caveats

InterpretationUse capability families, not a single synthetic leaderboard, to choose models

DatePrepared on 2026-03-14 · designed for static publication

Model profiles

Short operational profiles focused on fit, strengths, and caveats rather than marketing language.

Benchmark explorer

Filter, sort, and compare overlapping benchmark rows. Charts are rendered with Chart.js; the table is generated from the same underlying dataset to keep the page internally consistent.

Selected benchmark comparison

Choose a benchmark from the filter panel to compare all models on a single row.

If the chart does not render, the table below contains the same data.

Category average snapshot

A simple category-level summary based only on the rows included in this report. Missing data is ignored rather than imputed.

Category averages are derived only from the benchmark rows included in this page.

Table controls

Category

Chart row

Sort rows

Hide rows with fewer than 3 model scores Show evaluation notes

10 rows

Benchmark	GPT‑5.3 Codex	GPT‑5.4	Claude Sonnet 4.6	Claude Opus 4.6	Gemini 3 Pro
Terminal-Bench 2.0 Coding agents · Agentic terminal coding and end-to-end CLI work.	77.3% OpenAI self-reported, xhigh effort.	75.1% OpenAI self-reported, xhigh effort.	59.1% Anthropic-reported agentic terminal coding.	65.4% Anthropic Terminus-2 harness, 5–15 samples per task.	54.2% Google-reported via Terminus-2 agent.
SWE-Bench Pro (Public) Coding agents · Diverse agentic software engineering tasks.	56.8% OpenAI self-reported, xhigh effort.	57.7% OpenAI self-reported, xhigh effort.	n/a	n/a	54.2% Google-reported, single attempt.
SWE-Bench Verified Coding agents · Verified software engineering benchmark.	n/a	n/a	79.6% Anthropic-reported agentic coding result.	80.8% Anthropic average over 25 trials.	76.2% Google-reported, single attempt.
OSWorld / OSWorld-Verified Computer use · Computer-use capability in desktop-style environments.	64.7% OpenAI self-reported OSWorld-Verified.	75.0% OpenAI self-reported OSWorld-Verified.	72.5% Anthropic-reported OSWorld-Verified style result.	72.7% Anthropic-reported computer-use evaluation.	n/a
BrowseComp Tool use & search · Persistent search for hard-to-locate information online.	n/a	82.7% OpenAI self-reported, xhigh effort.	74.7% Anthropic agentic search result.	84.0% Anthropic run with search, fetch, tool calling, and compaction.	n/a
MCP Atlas Tool use & search · Multi-step workflows across large tool ecosystems.	n/a	67.2% OpenAI self-reported, xhigh effort.	61.3% Anthropic scaled tool-use result.	59.5% Anthropic max-effort run; high-effort run reportedly scored higher than listed baseline in announcement notes.	n/a
GPQA Diamond Reasoning · Hard graduate-level science reasoning.	n/a	92.8% OpenAI self-reported, xhigh effort.	89.9% Anthropic-reported GPQA Diamond.	91.3% Anthropic-reported GPQA Diamond.	91.9% Google-reported, no tools.
Humanity’s Last Exam Reasoning · Frontier multidisciplinary reasoning benchmark; tool settings vary substantially.	n/a	39.8% OpenAI number is the no-tools score from the official announcement excerpt reviewed.	49.0% Anthropic reported with-tools score; announcement also mentions 33.2% without tools.	53.1% Anthropic reported with-tools score using search, fetch, code execution, and compaction.	45.8% Google-reported search + code setup.
MMMU-Pro Multimodal · Multimodal understanding and reasoning.	n/a	81.2% OpenAI self-reported no-tools score.	75.6% Anthropic with-tools score; announcement also cites 74.5% without tools.	77.3% Anthropic with-tools score; announcement cites 73.9% without tools.	81.0% Google-reported multimodal score.
ARC-AGI-2 Reasoning · Abstract reasoning and puzzle-style fluid intelligence.	n/a	73.3% OpenAI self-reported ARC-AGI v2 verified result.	58.3% Anthropic-reported ARC-AGI-2.	68.8% Anthropic-reported max-effort result.	31.1% Google ARC Prize Verified result.

Note: the table intentionally does not pin the leftmost column anymore; on smaller screens that behavior was getting in the way of actually reading the data. Horizontal scrolling is preserved, but the layout now prioritizes legibility over aggressive stickiness.

Use-case recommendations

These are pragmatic starting points, not categorical prescriptions.

Best for coding agents

Primary choice: GPT‑5.3 Codex. Alternative: GPT‑5.4 when broader office and computer-use capability also matters.

Best one-model OpenAI deployment

Primary choice: GPT‑5.4. It offers the cleanest all-round capability mix in the OpenAI family covered here.

Best default for many teams

Primary choice: Claude Sonnet 4.6. It provides the most attractive balance between cost, quality, and practical versatility.

Best for hard, long, expensive tasks

Primary choice: Claude Opus 4.6. Use it when failure costs are higher than model costs.

Best for multimodal synthesis

Primary choice: Gemini 3 Pro. Especially relevant for mixed media, large context, and broad reasoning-heavy workloads.

Best for controlled spend

Primary choice: Claude Sonnet 4.6. Step up to Opus 4.6 only when task difficulty clearly justifies the premium.

Benchmark glossary

Short descriptions of what each evaluation is actually trying to measure.

Sources and method

Primary sources were preferred. Aggregated metadata was used cautiously to normalize overlapping rows and reduce repetitive extraction work.

Primary sources reviewed

Supporting normalization source

The aggregator was used as a normalization aid, not as an unquestioned authority. Where an official source and an aggregated summary diverged, the official source or an explicit caveat was preferred.

Methodology note: this report intentionally avoids collapsing everything into one synthetic ranking. Different rows answer different questions, and different users will rationally choose different winners.