Async Cloud Coding Agents 2026: Vibe, Devin, Codex, Cursor

The interesting part of “AI for coding” stopped being autocomplete a while ago. The interesting part now is that you can fire off a task from your phone, walk away, and come back to a draft pull request. That’s a different product. It’s also a different bill.

On April 29, Mistral dropped Vibe Remote Agents powered by Medium 3.5, and the async-cloud-agent category officially has six or seven serious players instead of three. That’s enough of a shift to stop and look at what’s actually shipping, what each one costs, and which one is worth standardizing on if you’re not just running solo experiments.

This post is about async coding agents — the cloud kind. The in-IDE assistants you talk to in real time (Cursor’s chat, Claude Code in your terminal, Copilot’s tab completion) are a different category. There’s a separate post on the blog comparing those. The line is roughly: if it streams tokens into your editor, it’s a sync assistant; if it spawns a sandbox in the cloud and pings you when the diff is ready, it’s an async agent.

Why “async cloud” became the strategic category in 2026

The shift happened because two things finally got cheap enough at the same time: long-context models that can actually keep a whole repo in head, and ephemeral cloud sandboxes you can throw away when the task ends.

Once both of those exist, the natural product isn’t “AI helps me type.” It’s “I describe the task, the agent goes off, I get a PR.” You can run five or eight of those in parallel against the same repo. Your laptop doesn’t get hotter. You can be on a flight while three of them are running tests.

The catch is that async agents are way more expensive per task than sync assistants — and a lot more variable. A Cursor tab completion costs a fraction of a cent. An async agent run can burn $5–$15 in compute and model tokens before it lands a PR, and if it fails, you paid for the failure. Whether that math works depends entirely on whether the merged PR rate is good enough that the per-merge cost beats a junior engineer’s hourly rate. As of mid-2026, for the right tasks, it does. For the wrong tasks, it’s still cheaper to write the code yourself.

What Mistral shipped on April 29

Vibe Remote Agents are the visible product. Medium 3.5 is the model behind them. Both shipped together.

Medium 3.5 is a 128B dense multimodal model with a 256k context window, distributed under a modified MIT license — meaning the weights are actually downloadable from Hugging Face, not just available behind an API. Mistral reports 77.6% on SWE-Bench Verified, which puts it within striking distance of the closed frontier coding models on the same benchmark. API pricing is $1.50 per million input tokens and $7.50 per million output, roughly half of what hosted frontier coding APIs charge.

Vibe is the agent shell. You spawn a session from a CLI, from Le Chat, or from the new Work mode UI, and it runs in Mistral’s cloud sandbox. The trick that nobody else is doing yet: you can teleport a local CLI session up to the cloud, and pull a cloud session back down to your laptop. So if you start something locally and it turns into a long-running task, you don’t have to restart — you push it up, the same conversation continues remotely, and you get the PR notification later.

The combination matters more than either piece alone. Open-weight models existed. Cloud agent sandboxes existed. But “open-weight model + EU-hosted async agent + cloud↔local teleport” is a real differentiator for European teams that have been politely declining to put their codebase into a US frontier API. I’d expect to see a wave of self-hosted Vibe deployments by late summer.

The honest limitations: Vibe’s sandbox tooling around test runners, secrets, and CI hooks is newer than what Devin or Codex Cloud have built. If your project depends on a complicated test harness, you’ll spend more time wiring it up. And the SWE-Bench number, while strong, is not the same as the workplace number — more on that below.

The contenders, briefly

Seven products are worth taking seriously in this category right now. Here’s what each one actually is, stripped of marketing.

Cognition Devin. The original async SWE agent, the one that defined the category in 2024. The big news from this year is that Devin 2.0 dropped the entry price from $500/month to $20/month plus $2.25 per ACU (Agent Compute Unit, roughly 15 minutes of active work). The Team plan is still $500/month with 250 ACUs included. Cognition also raised at a $25B valuation and integrated Windsurf this spring. Devin’s edge is the longest track record of running unattended on real codebases — they’ve had two years to fix the failure modes everyone else is just discovering.

OpenAI Codex Cloud. Re-launched in 2025 as a cloud agent that works on many tasks in parallel inside its own sandbox, preloaded with your repo. It’s wired into ChatGPT (available on Plus), the dedicated Codex app, and the IDE extensions. The sandbox is open-source-configurable, with a default policy that the agent can only edit files in its branch and has to ask before doing anything with elevated permissions. The Codex app itself is the cleanest async-agent UX I’ve used — it’s clearly built around the assumption that you’re managing a fleet of agents, not chatting with one.

Cursor Background Agents. Cursor’s bet on async, included on the Pro plan at $20/month. You can run up to eight in parallel. They clone your repo, work autonomously, and open a PR when done. The pitch is: same Cursor account, same model routing, same context engine you already trust from the editor — just running headless in the cloud. If you’re already paying for Cursor Pro, the marginal cost of trying async is basically zero.

GitHub Copilot Workspace. The Microsoft answer, now generally available inside Copilot Enterprise and Business with model choice across Claude and GPT. The selling point is GitHub-native: it lives where your issues, PRs, and Actions already live, so the integration story is short. The selling point against it is that the agent feels less aggressive than Devin or Cursor’s — it’s more “drafts a plan, asks for your approval, then drafts code” than “goes off and tries to ship.”

Replit Agent 3. Strongest if your work happens inside Replit itself — the agent has unfair advantages on environment setup and deploy, because the dev environment is the platform. Less compelling if you’re trying to point it at a GitHub repo with its own toolchain.

Sourcegraph Amp. The enterprise pick. Amp’s edge is that Sourcegraph’s codebase indexing has been the gold standard for understanding large monorepos for years, and Amp inherits all of that. If you’re running an agent against a 5M-LOC repo where context is the bottleneck, Amp is in a different conversation than the others.

Mistral Vibe. Covered above. The newcomer with the open-weight + EU-hosting + teleport story.

The SWE-Bench scoreboard, with a warning

SWE-Bench Verified scores as of early May 2026, from each vendor’s published numbers:

Mistral Medium 3.5 — 77.6%
Anthropic Claude Opus 4.7 (which several of these agents use) — reportedly in the high 80s
GPT-5.5 (powering Codex Cloud) — competitive with Opus on coding benchmarks
Devin 2.0 — Cognition has not published a fresh SWE-Bench Verified score; older numbers (~14%) are stale and earlier reports place updated runs in the 75–78% range
Cursor — comparable to Copilot in the 50s on certain benchmark splits, though the Background Agent path uses frontier models that score considerably higher

The warning is the same one researchers have been giving for two years: SWE-Bench Verified is a closed set of GitHub issues from a handful of Python repos. A 77.6% there does not mean the agent will close 77.6% of your tickets. The benchmark rewards short, well-scoped tasks with a clear test signal, which is the easy half of real engineering work. Treat SWE-Bench scores as a floor — “this agent has minimum competence” — not as a ranking of what’ll actually merge in your repo.

The number that would actually matter, “merged-PR rate on real proprietary codebases,” is one nobody publishes for obvious reasons.

The cost math nobody puts in their pricing page

Subscription pricing tells you almost nothing about what these agents cost in production. Here’s what to actually budget for.

Devin Core ($20/mo + $2.25/ACU). An ACU is roughly 15 minutes of agent work. A non-trivial task — say, a small feature with tests — runs 2–6 ACUs in my experience, so $4.50–$13.50 per attempt. If you’re getting a 50% merge rate, your effective cost per merged PR is $9–$27. That’s cheap for anything a junior would have spent half a day on. It’s expensive for one-line bug fixes.

Cursor Background Agents (included in Pro $20/mo). “Included” is doing a lot of work in that sentence. Background agents draw from your $20/month of model credits, and a single complex task can eat $2–8 of that. Heavy users land on Pro+ ($60) or Ultra ($200) within a month. Honest pricing if you treat the $20 as a starting balance, not a subscription.

Codex Cloud (ChatGPT Plus, $20/mo). Bundled in if you already pay for ChatGPT, which a lot of people do. The opportunity cost is that you’re using your ChatGPT quota on coding tasks instead of writing or research. For teams, the Codex app pricing is per-seat with usage limits.

Mistral Vibe (API-priced at $1.50/$7.50 per M tokens). The cheapest per-token of the frontier-class options. A typical agent task is 50–200k input tokens and 5–30k output, which lands at roughly $0.10–$0.60 per task in raw model cost. That’s before sandbox compute, which Mistral hasn’t fully priced out yet for Vibe specifically.

Devin Team ($500/mo, 250 ACUs). The honest math here is that you’re paying $2/ACU instead of $2.25, plus you get team features. If you’re not burning through 200+ ACUs/month across your team, you’re overpaying.

The hidden cost on every one of these is the failure tax: the agent runs for 20 minutes, can’t finish, and you eat the compute. The vendors with the best sandbox isolation also tend to have the highest failure tax because they’re more cautious about asking before doing risky things. There’s no free version of this trade-off.

Sandbox, secrets, and the long-test-suite problem

The single most underrated dimension is how each agent handles your secrets and your slow test suite.

Codex Cloud’s sandbox model is the most explicit: agents can only touch files in their branch, network access requires approval, and secrets are injected per-task via OpenAI’s secrets manager. This is the right model for regulated environments, and it’s also the one where you’ll most often see “agent stopped to ask permission” interruptions.

Cursor Background Agents inherit Cursor’s existing repo permissions, which is fast to set up but means an agent has the same blast radius as your local Cursor install. Convenient. Probably fine for most teams. Not what your security review will approve for a financial services repo.

Devin runs in Cognition’s managed VMs and has the most mature handling of long-running tests — it’s been tuned for years on real customer codebases where npm test takes 20 minutes. The trade-off: you’re trusting Cognition’s infrastructure with your repo, which is a meaningful corporate-policy question.

Mistral Vibe’s sandbox is EU-hosted, which is the entire reason GDPR-conscious teams will look at it. Mature it isn’t, yet — but the regulatory positioning is real, and the fact that the model weights are downloadable means you can in principle run the whole thing on your own infrastructure if Mistral’s hosted version doesn’t pass procurement.

Copilot Workspace runs inside GitHub Actions, which means your existing repo permissions, your existing secrets management, and your existing CI minutes all apply. For teams already on Copilot Enterprise this is a one-line decision. For everyone else, it forces you to live in GitHub’s runner pricing.

When async is the wrong answer

Async cloud agents are not the right tool when:

The task is small enough that explaining it takes longer than doing it. Telling Devin to fix a typo is silly. Just fix the typo.
You need to iterate visually (UI tweaks where the feedback loop is “look at it”). A sync editor with hot reload still wins.
The task requires reading subtle context that doesn’t live in the repo (Slack threads, design docs, the tribal knowledge in someone’s head). Async agents can’t ask your CTO a clarifying question in real time.
You can’t stomach the failure rate. If 1-in-3 PRs being garbage is a problem for your team’s review bandwidth, the agent is making the team slower, not faster.

The break-even point I keep landing on personally: async agents are worth it for tasks I’d estimate at 30 minutes to a few hours of focused work, where I can write a clear spec in two paragraphs and the test signal is unambiguous. Outside that window, sync tools or just typing the code remain faster.

The pick, by team shape

For a 5-engineer startup already on Cursor: turn on Background Agents this week. The marginal cost is your existing $20/month and the only commitment is hitting a checkbox. If it pays off, upgrade tier; if it doesn’t, lose nothing.

For a 50-engineer scale-up: Codex Cloud if you’re a heavy ChatGPT shop, Devin Team if you want the most mature unattended-agent product and are willing to budget $500/month plus overage. Either is defensible. I’d lean Codex Cloud right now because the Codex app’s multi-agent UX is genuinely a step ahead.

For a regulated enterprise (healthcare, finance, EU public sector): Mistral Vibe is the only one with a credible answer to “we can’t ship our code to a US-hosted frontier API.” If you’re not regulated, Sourcegraph Amp is still the thinking-engineer’s pick for any monorepo over a million lines.

For a solo developer who wants to try the category without a commitment: Codex Cloud bundled into ChatGPT Plus is the lowest-friction starting point. You’re already paying for ChatGPT.

One thing to try this week

If you’ve never run an async agent end to end, pick the smallest real ticket in your backlog — something you’ve been ignoring for two weeks because it’s annoying — and hand it to whichever agent is already in your stack. Don’t pick a hard task. Pick a boring one. Watch what happens.

The category’s whole pitch lives or dies on whether the agent earns back the time you spent describing the task. You’ll know within one ticket whether it does, for you, on this codebase.