Google Antigravity 2.0 vs Cursor vs Windsurf vs Claude Code: Is Your IDE Now a Fleet of Agents?

Google shipped Antigravity 2.0 at I/O on May 19, and the pitch is bigger than another autocomplete upgrade. The editor isn’t the product anymore. The product is a dashboard where you dispatch a handful of agents, watch them plan, code, spin up a browser to test their own work, and come back with video proof — while you sit in the role of someone reviewing pull requests rather than typing them.

That’s a genuinely different way to work, and it’s worth taking seriously. It’s also why the question “should I switch from Cursor?” doesn’t have a clean answer right now. The launch was a mess for a lot of existing users, and the tool you’d be switching to is partly a promise. So let me lay out what actually changed, what each of the four main options is good at, and where I’d point you depending on how you like to code.

What “agent-first” actually means, beyond the marketing

The phrase gets thrown around loosely, so here’s the distinction that matters day to day.

The old model — Cursor in its early days, Copilot, the original Antigravity — is AI in the editor. You’re driving. The model rides shotgun, finishes your line, answers a chat question, edits a file you point it at. You stay in the loop on every keystroke-sized decision.

Antigravity 2.0 flips that. Its “Agent Manager” is a mission-control surface where you describe a task and the system spawns subagents that work concurrently across your codebase. One might be refactoring an API while another writes the tests and a third opens Chrome to click through the UI and confirm nothing broke. You’re not watching a cursor move. You’re reviewing artifacts after the fact — including short screen recordings the agents produce as evidence that the feature works.

Claude Code sits in a third spot that people lump in but shouldn’t. It’s terminal-native and agentic, but it’s deliberately on a leash. It plans, runs commands, edits across files, and checks its own work, yet the whole thing happens in a transcript you can read top to bottom. It’s agentic without pretending you’re a manager — more like an extremely capable engineer narrating every move in your terminal.

So when someone says “agent-first,” ask which flavor. Parallel fleet you supervise (Antigravity), in-flow pair that you drive (Cursor, Windsurf), or auditable single-threaded agent in the terminal (Claude Code). They’re solving different anxieties.

The four contenders, by what they’re genuinely best at

Google Antigravity 2.0 is the most ambitious of the group. It’s now a five-surface suite: the desktop app with Agent Manager, a CLI, an SDK, Managed Agents in the Gemini API, and enterprise hooks through Google’s agent platform. It runs on Gemini 3.5 Flash, which Google says is roughly four times faster than other frontier models — and speed matters a lot when you’re running several agents at once and waiting on all of them. The browser-driven testing with video artifacts is the standout feature; I haven’t seen anyone else make “the agent proves it works” this central. Google quoted a 76.2% SWE-bench Verified figure at the keynote, though independent SWE-Bench Pro numbers tell a more modest story (more on benchmarks below).

Cursor is still the one I’d hand to someone who wants to feel fast immediately. It’s the in-flow coding experience that nailed the feel of editing alongside a model, and Composer 2.5 — which shipped May 18, a day before Antigravity — added cloud agent environments, a Build in Parallel mode, and Teams integration. So Cursor isn’t sitting still on the parallel-agent front. If your work is mostly hands-on, tight-loop coding where you want to stay in the file, this is home.

Windsurf earns its keep on large, messy codebases. Its Cascade context retrieval is the thing people who work in big monorepos keep coming back to — it tends to pull the right surrounding code without you spoon-feeding it paths. The Pro plan went from $15 to $20/month in May, and there’s now a $200/month Max tier that bundles Devin Cloud and the Devin CLI, which is a real signal about where they’re aiming (heavier, more autonomous work).

Claude Code is what I reach for when correctness and reviewability matter more than flash. Large refactors, security passes, anything where I want to read exactly what changed and why. Anthropic doubled the 5-hour usage limits across Pro, Max, Team, and Enterprise in May, which eased the single biggest complaint about it. It’s also model-locked to Claude in a way the others aren’t, and Claude Opus 4.7 posts the strongest production-code-quality numbers of the bunch (around 64.3% on SWE-Bench Pro), so that lock-in is less of a sacrifice than it sounds.

If you want the head-to-head on the models themselves rather than the tools, I went deep on that in Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro and on the cheaper-frontier angle in Cursor Composer 2.5 vs Opus 4.7.

The launch reality nobody at Google wants in the headline

Here’s the part the vendor blog won’t tell you straight. The 2.0 rollout hurt a chunk of existing users badly.

The update pushed automatically on launch day. For people already running the original Antigravity, it removed the built-in code editor from their environment, wiped stored configurations, and left them staring at a broken setup with no obvious route to the new CLI. Imagine opening your IDE on a Tuesday morning to find the editor gone and your settings cleared. That happened, at scale, to early adopters — the exact people most inclined to evangelize the tool.

The free tier also got gutted. It dropped from 250 requests a day to 20. Twenty requests is enough to look around and not much else, which makes “just try it for a week before committing” a lot harder than it was.

None of this means the underlying platform is bad. The parallel-agent architecture is real and the browser-testing idea is genuinely good. But a launch like this tells you something about maturity: the orchestration is ahead of the operational polish. That gap usually closes in a couple of months. It just hasn’t closed yet.

Where the decision actually gets made: how much autonomy do you want?

Strip away the feature lists and this is the real axis. Not “which is most powerful” — which level of control matches how you work.

If you like keeping your hands on the wheel, editing in the file, steering every few seconds, then a fleet of agents working out of sight will feel less like a superpower and more like losing track of your own codebase. Cursor and Windsurf respect that instinct. You’re still the one coding; the AI is amplifying you.

If your bottleneck is that you have five independent tasks and only one of you, the Antigravity model is compelling. Kick off five agents, go review the first one’s video while the others grind, merge what passes. The catch: this only pays off if reviewing agent output is faster for you than doing the work. For some tasks it absolutely is. For subtle, architecture-heavy changes, reviewing a confident agent’s diff can be slower and more dangerous than writing it yourself — because you have to reconstruct reasoning you didn’t witness.

And if you need to be able to explain every change in a code review or a compliance context, Claude Code’s readable transcript is hard to beat. The whole session is a document. That auditability is underrated until the day someone asks “why did this line change” and you can actually answer.

The “video proof” idea deserves a flag here too. It’s clever, but watching a 30-second recording of a UI working is not the same as knowing the code is correct. It catches the obvious breakage. It won’t catch the race condition that fires one time in fifty. Treat the artifact as a smoke test, not a sign-off.

Pricing at the frontier tier, and what you’re locking into

The serious plans have converged around $200/month, with Google having just shaken up the middle.

Google AI Ultra now has a $100/month tier aimed squarely at developers — 5× the Pro usage limit in Antigravity plus priority access and Gemini 3.5 Flash — and Google cut its top Ultra plan from $250 to $200 at the same time. That $100 middle option is the most interesting pricing move of the month; nobody else has a frontier-grade developer tier there.
Cursor runs Pro+ at $60/month and Ultra at $200/month.
Windsurf sits at $20/month for Pro and $200/month for Max (with the Devin bundle).
Claude Code is included in Claude’s Pro and Max subscriptions, with API usage billed separately if you go that route.

Lock-in is the quieter cost. Antigravity ties you to Gemini models and Google’s surfaces. Claude Code ties you to Claude. Cursor and Windsurf let you bring different models, which is worth real money if you like switching based on the task — something I argued for in Gemini 3.5 Flash vs GPT-5.5 vs Claude Opus 4.7: when to switch. If you believe one lab will stay ahead forever, lock-in is fine. I don’t believe that, so I value the flexibility.

A word on those benchmark numbers

Take the SWE-bench figures with salt. Google’s 76.2% Verified claim and Claude’s ~64.3% SWE-Bench Pro number aren’t measuring the same thing on the same set, and Pro is the harder, more realistic benchmark. Flash’s SWE-Bench Pro score (around 55) trails the leaders even as it wins on raw speed. The honest read: Antigravity’s edge is throughput and orchestration, Claude’s edge is per-task code quality, and the gap between a keynote slide and your actual repo is wide. Benchmarks pick the winner of a contest you’re not entering.

So who should use what

For solo, in-flow coding where you want to stay in the file and feel fast — Cursor. It’s still the smoothest hands-on experience, and Composer 2.5 quietly added parallel work if you grow into it.

For large monorepos and enterprise codebases where context retrieval is the real pain — Windsurf. Cascade pulls its weight, and the Devin bundle at the Max tier hints at heavier autonomy if you want it.

For refactors, security work, and anything you’ll have to defend in review — Claude Code. The readable transcript and Opus 4.7’s code quality are the combination I trust most when the change matters.

For genuinely parallel, multi-task work where you’re comfortable supervising rather than typing — Antigravity 2.0, with one caveat. Given the launch, I’d wait 60 to 90 days unless you’re an existing Google-stack shop with appetite for rough edges. The architecture is the most forward-looking of the four. The polish needs another release or two.

If you’ve got a spare afternoon, the cheapest experiment is to take one real task — not a toy — and run it through both Claude Code and Antigravity’s free tier, then compare how long reviewing each result took you. That number, not the benchmark, is the one that tells you which future you actually want.

Sources: TechCrunch, MarkTechPost, ChatForest review, Google blog — AI subscriptions, NxCode pricing comparison