Claude Sonnet 5 Review 2026: vs GPT-5.5, Gemini 3.5 Flash

Anthropic dropped Claude Sonnet 5 on June 30, and by the next morning it was already the default model in claude.ai, Claude Code, Cursor, VS Code, and GitHub Copilot. If you write code with an AI assistant, you were probably using it before you’d read a single benchmark. That’s the interesting part of this launch — the switch happened to you, not the other way around.

So the real question isn’t “is Sonnet 5 good.” It clearly is. The question is whether it changes what you should actually be running for a given job, and what it’s going to cost you once the honeymoon pricing ends on August 31. I’ve spent the last few days pushing it through agentic coding tasks, and there’s a more interesting story here than the headline “cheaper model, near-Opus quality” summary suggests.

What Sonnet 5 actually is

Sonnet 5 is Anthropic’s mid-tier model, sitting below Opus 4.8 and above Haiku in the lineup. The two things that matter about this release: it now ships with a 1M-token context window, and it lands close enough to Opus 4.8 quality that the price gap between them starts to feel absurd for a lot of work.

The model ID is claude-sonnet-5, and it’s a straight drop-in wherever you were calling Sonnet 4.6. No new API surface, no migration headache. Anthropic clearly wanted zero friction on adoption, which is why it went out as the default across every major coding surface on day one instead of hiding behind a preview flag.

The framing Anthropic is pushing is “most agentic Sonnet yet.” That’s marketing, but it’s pointing at something real. This model is tuned for tool use and multi-step work — the kind of long agent loops where the model reads files, runs commands, checks output, and keeps going. More on where that helps and where it doesn’t in a second.

The benchmarks that matter (and the ones that don’t)

Every launch post is going to throw a wall of benchmark numbers at you. Most of them don’t help you decide anything. Here are the ones I’d actually look at, with the competitors lined up:

Benchmark	Sonnet 5	GPT-5.5	Gemini 3.5 Flash	Opus 4.8
SWE-bench Pro	63.2%	58.6%	55.1%	69.2%
Terminal-Bench 2.1	80.4%	83.4%	—	74.6%
OSWorld-Verified	81.2%	—	—	83.4%
GDPval-AA v2 (Elo)	1,618	—	—	1,615

A few things jump out. On SWE-bench Pro — resolving real GitHub issues in real repos — Sonnet 5 beats both GPT-5.5 and Gemini 3.5 Flash by a comfortable margin. If your mental model is “the frontier flagship always wins on coding,” this launch breaks it. A mid-tier model is now out-coding a competitor’s flagship on the benchmark that most resembles day-to-day engineering.

But look at Terminal-Bench 2.1, which measures driving a terminal through a task. GPT-5.5 takes it, 83.4% to 80.4%. And here’s the odd one: Sonnet 5 actually beats Opus 4.8 on that same terminal benchmark (74.6%), while losing to Opus on SWE-bench Pro. So “which Anthropic model is better at coding” doesn’t even have a clean answer — it depends on whether the task is code-reasoning-heavy or orchestration-heavy.

The GDPval knowledge-work Elo is the number I’d point a skeptic at. At 1,618 it edges Opus 4.8’s 1,615. That’s noise-level identical, on a benchmark meant to capture real professional output rather than puzzle-solving. For a model that costs a fraction of Opus, matching it on general knowledge work is the whole pitch in one line.

What I’d ignore: any single-percentage-point difference on a benchmark you’ve never run yourself. Benchmark deltas under a couple of points don’t survive contact with your actual codebase and your actual prompts.

Pricing: read the calendar, not just the sticker

Here’s where you need to pay attention, because there are two prices and the one you see today isn’t the one you’ll pay in September.

Sonnet 5, now through Aug 31: $2 input / $10 output per million tokens
Sonnet 5, from Sept 1: $3 input / $15 output
GPT-5.5: $5 / $30
Gemini 3.5 Flash: $1.50 / $9
Opus 4.8: $5 / $25

Anthropic set the intro pricing so that moving off the old Sonnet is roughly cost-neutral, then it steps up 50% on September 1. That’s a normal launch tactic, but it means any budget math you do this week is optimistic. Model your spend at the $3/$15 rate, not the promo rate, or you’ll get a surprise in your September invoice.

At the post-promo price, the comparison gets genuinely interesting. Sonnet 5 undercuts GPT-5.5 by roughly half on both input and output while beating it on SWE-bench Pro. Against Opus 4.8 — its own sibling — you’re paying $3/$15 instead of $5/$25 for output that matches on knowledge work and comes within six points on hard coding. For most teams that’s not a close call anymore.

Gemini 3.5 Flash is still the cheap seat at $1.50/$9, and if your workload is high-volume and latency-sensitive, that gap matters. But it’s the weakest of the three on SWE-bench Pro by a clear margin. You’re trading real coding capability for the lower bill.

Agentic coding: “best coder” is the wrong question

I keep seeing “is Sonnet 5 the best AI coder now” as the framing, and it’s the wrong axis. There are at least three different things people mean by that, and Sonnet 5 sits differently on each.

If you mean coding depth — untangling a gnarly bug across a big unfamiliar codebase, reasoning about a subtle race condition — Opus 4.8 is still ahead on SWE-bench Pro, and I felt that difference on the hardest tasks I threw at both. Sonnet 5 is close, not equal. When I’m genuinely stuck, I still reach for Opus.

If you mean tool orchestration speed — an agent grinding through a long loop of read-edit-run-check — Sonnet 5 is the sweet spot. It’s fast, it’s tuned for exactly this, and at a third of Opus’s output price you can let it run longer loops without wincing at the token counter. This is where the “most agentic Sonnet” claim earns its keep. Most of my day-to-day Claude Code sessions are this shape, and Sonnet 5 has quietly become the right default for them.

If you mean cost per solved task at volume, the answer depends entirely on your task mix and you should measure it, not trust a table. A cheap model that fails and retries twice isn’t cheap.

The honest read: Sonnet 5 doesn’t dethrone anything. It compresses the gap. The distance between “the model you can afford to run all day” and “the model you save for hard problems” just got a lot smaller, which changes how you’d route work between them. If you’re setting up that kind of routing, the same logic I laid out in the model routing guide still applies — Sonnet 5 just moves up as the new default tier.

Where the 1M context helps (and where it’s a trap)

The million-token window is the spec everyone repeats, so let me be blunt about it. Big context is useful for a narrow set of things and oversold for everything else.

It genuinely helps when you need the model to hold a large codebase or a long document set in view at once — reviewing a sprawling PR against the whole module it touches, or answering questions across a big spec without you hand-picking the relevant files. That’s real, and it’s a workflow Sonnet 4.6 couldn’t do without chunking.

Where it’s a trap: dumping your entire repo into context and hoping the model sorts it out. Recall degrades across very long contexts for every model, Sonnet 5 included, and you’re paying input tokens for the privilege of worse attention. A well-targeted 30k-token prompt beats a lazy 800k-token one almost every time. Big context is a tool for when you actually need the breadth, not a substitute for giving the model the right files.

Should you switch? A decision by use case

Here’s how I’d actually decide, by what you’re doing:

You’re already in Claude Code, Cursor, or Copilot on Sonnet. You’ve already switched — it’s the default. Just be aware of the September price bump and check your usage patterns before then. Nothing to do today.

You’re running Opus 4.8 for everyday coding. Try demoting to Sonnet 5 as your default and keep Opus for the hard stuff. For most sessions you won’t feel the difference, and you’ll roughly halve your output-token bill. This is the switch with the clearest payoff.

You’re on GPT-5.5 for agentic coding. Worth a serious look. Sonnet 5 beats it on SWE-bench Pro and costs about half. GPT-5.5 still edges it on terminal-driving tasks, so if that’s your core loop the case is weaker — but for issue-resolution work, the numbers favor a switch. Test on your own tasks before committing a team.

You’re high-volume and cost-obsessed. Gemini 3.5 Flash at $1.50/$9 is still cheaper, and if your work is simpler and latency-sensitive it may be the right call. But benchmark the failure-and-retry rate, not just the sticker price — Sonnet 5’s higher success rate can make it cheaper per completed task.

You need long-context review or analysis. Sonnet 5’s 1M window plus its price makes it the obvious pick over Opus for this. Just don’t confuse “can hold a lot” with “should be handed everything.”

The catch nobody’s leading with

Two things keep this from being a clean victory lap.

First, that price step on September 1 is a 50% jump, and it’s easy to build a workflow at $2/$10 economics that doesn’t pencil out at $3/$15. If you’re standing up something new on Sonnet 5 right now, do the math at the higher number.

Second, “near-Opus quality” is doing some work in the marketing. On knowledge work, sure, it’s a wash. On the hardest coding tasks it’s six points back on SWE-bench Pro, and those six points are exactly the tasks where you most want the model to be right. Sonnet 5 is the better default. It’s not a reason to delete Opus from your toolbox.

If you’ve got Claude Code open, the useful experiment is cheap: take the last genuinely hard bug you solved with Opus, replay it on Sonnet 5, and see whether you’d have known the difference. If you wouldn’t have, you just found your new default and cut your bill. If you would have, now you know exactly where the line is — which is worth more than any benchmark table.

Sources: Anthropic — Introducing Claude Sonnet 5, MarkTechPost — Sonnet 5 vs Sonnet 4.6 vs Opus 4.8, DataCamp — Claude Sonnet 5, GPT-5.5 API pricing, Gemini 3.5 Flash pricing, Claude Opus 4.8 pricing