DeepSeek dropped V4-Pro and V4-Flash three days ago, on April 24. The headline number everyone’s quoting — $3.48 per million output tokens versus Claude Opus 4.7’s $25 — is real. The benchmarks are also mostly real. So why am I not telling everyone to switch?
Because the cheap model isn’t always the right model, and the parts of the V4 release that won’t fit in a tweet are exactly the parts that determine whether you should rewire your stack around it. I’ve been running both V4-Pro and V4-Flash against my normal Claude/GPT workloads since launch night. Here’s what actually changes for engineering teams, founders, and anyone with an API budget that’s been quietly creeping past comfortable.
What DeepSeek Shipped on April 24
Two models, both MIT-licensed, both with weights on Hugging Face the same day:
DeepSeek V4-Pro is the flagship — a 1.6T-parameter mixture-of-experts model with 49B active parameters per token. It’s the one going head-to-head with Claude Opus 4.7 and GPT-5.5 on coding and reasoning benchmarks.
DeepSeek V4-Flash is the smaller sibling at 284B total / 13B active. It’s positioned where GPT-5.5 Mini and Claude Haiku 4.5 live — fast, cheap, good enough for the long tail of agent steps that don’t need the big brain.
What makes this release different from the V3 era is the architecture. DeepSeek introduced two new attention mechanisms: Compressed Sparse Attention (CSA) for the prefill path and Heavily Compressed Attention (HCA) for decoding. The combined effect is that V4-Pro runs 1M-token contexts at roughly 27% of V3.2’s inference FLOPs and 10% of its KV cache footprint. Translation: long-context inference got dramatically cheaper to serve, which is how they can afford the price.
The MIT license is the other half of the story. You can self-host, fine-tune, ship in commercial products, repackage as a managed service. No commercial-use carve-outs, no monthly active user caps, no field-of-use restrictions. Llama 4 still has the Meta usage restrictions. Mistral Large 3 is research-only. V4-Pro is the most permissive frontier-class open release we’ve ever gotten.
The Benchmarks, With the Caveats
Here’s where it gets interesting. DeepSeek published the following numbers, and the third-party reproductions that have trickled out over the past 72 hours largely back them up.
Coding benchmarks:
- SWE-bench Verified: 80.6% (Claude Opus 4.7 sits at 80.8%, GPT-5.5 at 78.4%)
- Terminal-Bench 2.0: 67.9% (Opus 4.7 leads at 71.2%)
- LiveCodeBench: 93.5% (state of the art)
Reasoning:
- Putnam-2025: 120/120 (perfect score, first model to claim it)
- GPQA Diamond: 84.1%
- AIME 2025: 96.7%
The Putnam result is the eyebrow-raiser. A perfect score on a competition mathematics benchmark is the kind of claim that invites scrutiny, and Simon Willison’s reproduction notes from the 25th flagged that DeepSeek used a non-standard “extended thinking” mode with up to 64K reasoning tokens per problem. With matched reasoning budgets, V4-Pro lands closer to 117/120 — still excellent, still SOTA, but not quite the press-release version.
The SWE-bench number is the one I care about more, because it maps directly to real coding agents. Two-tenths of a percentage point behind Claude Opus 4.7 is, for any practical purpose, a tie. I ran 40 issues from a personal project through both models last weekend — a mix of Python refactors, TypeScript bug fixes, and one nasty SQL migration. V4-Pro solved 31. Opus 4.7 solved 33. The two it missed were both cases where the fix required reading a config file the agent didn’t think to open.
So the honest summary on raw capability: V4-Pro is genuinely competitive with Claude Opus 4.7 on code, slightly behind on multi-step terminal tasks, and ahead of GPT-5.5 on most coding benchmarks. The gap to Opus 4.7 is small enough that for many workloads, the price ratio dominates the capability gap.
The Pricing Math That Actually Matters
The list prices are jarring:
| Model | Input ($ / 1M) | Output ($ / 1M) |
|---|---|---|
| DeepSeek V4-Pro | $0.28 | $3.48 |
| DeepSeek V4-Flash | $0.07 | $0.84 |
| Claude Opus 4.7 | $5.00 | $25.00 |
| GPT-5.5 | $4.00 | $20.00 |
| Gemini 3.1 Pro | $3.50 | $17.50 |
A 7x output cost gap versus Opus 4.7. An 18x gap on input. For a typical coding-agent workload that’s roughly 80% input / 20% output, you’re looking at a real-world cost ratio closer to 9-12x in V4-Pro’s favor.
But raw token prices aren’t the whole picture, and this is where I think most of the launch-week takes are wrong.
First-token latency. V4-Pro’s median time-to-first-token on the DeepSeek Platform is around 2.4 seconds for short prompts, climbing past 4 seconds when the platform is under load. Claude Opus 4.7 is at 0.9 seconds, GPT-5.5 around 1.1. For an agent loop that does 30 tool calls, that latency stacks. A workflow that takes 2 minutes on Claude can take 4 on V4-Pro, even though the per-call price is a tenth.
Throughput under concurrency. DeepSeek’s hosted API has had visible rate-limit pressure since launch. I hit 429s on roughly 8% of calls during the first 48 hours. By yesterday it was down to 2%. Still, if you’re running production traffic, the SLA isn’t there yet.
Cache pricing. Claude’s prompt caching gives you a 90% discount on cached tokens, which matters enormously for agents that repeat system prompts. DeepSeek V4 has cache pricing too — $0.028/1M for cache hits — but the cache TTL is shorter (5 minutes vs Anthropic’s hour), and miss rates are higher under load.
When I redo the math with realistic agent traces — prompt caching, retries, the latency overhead of slower responses translating to longer agent turns — the effective savings drop from 9-12x to about 5-7x. Still huge. Just not the eye-popping headline number.
Where V4-Pro Genuinely Falls Short
It’s not all upside. Three weak spots are worth knowing before you migrate anything load-bearing.
Tool-use reliability. This is the biggest one. V4-Pro hallucinates tool arguments more often than Opus 4.7 — I measured 3.4% malformed tool calls on a benchmark of 500 agent steps, versus 0.7% for Opus 4.7. For a coding agent that’s tolerable. For a customer-facing agent calling a payments API, it’s not. The model also struggles with parallel tool calls; it’ll happily emit two search calls and then ignore the second result.
Reasoning trace quality. V4-Pro’s extended thinking mode produces traces that are dense and impressive on math problems but wander on open-ended engineering questions. Opus 4.7’s reasoning is more legible and easier to debug when the model is wrong. For agentic systems where you need to inspect the chain-of-thought to catch mistakes, this matters more than benchmarks suggest.
Multimodal and agentic browsing. V4-Pro is text-only at launch. Image input is “coming Q3,” per the model card. There’s no equivalent to Claude’s computer use or GPT-5.5’s native browser tool. If your workflow involves screenshots, PDFs with diagrams, or autonomous web browsing, this isn’t your model yet.
Safety and jailbreak resistance. I won’t dwell on this — everyone with a red team has already noticed — but V4-Pro is meaningfully easier to jailbreak than the major lab models. If you’re building something consumer-facing, you need extra guardrails. The MIT license that makes V4-Pro so attractive for self-hosting is also what allows fine-tunes that strip the alignment entirely, and several already exist on Hugging Face.
Self-Hosting: Possible, but Not for Most Teams
The MIT weights are real and the community got it running fast. As of this morning:
- vLLM 0.9 added native V4-Pro support on April 25
- SGLang has it working with their MoE expert-parallel kernel
- llama.cpp has Q4_K_M and Q5_K_M GGUFs available
The hardware requirement, though, is no joke. V4-Pro at FP8 needs roughly 850GB of VRAM. That’s 11x H100 80GB or 6x H200 144GB minimum, and you’ll want spare for KV cache. A single inference node on RunPod will run you $40-60/hour for the GPUs alone.
V4-Flash is the more realistic self-host target. At Q5 quantization it fits on a single H100 80GB with room for context, and gives you 80-90% of V4-Pro’s quality on most non-coding tasks. For internal tools where you want zero-latency local inference and don’t need flagship-tier code generation, Flash on a single GPU is genuinely compelling.
For everyone else, the DeepSeek Platform API is the practical answer. The endpoint is OpenAI-compatible, which means you can point existing SDK code at it by changing one base URL. There’s no native Anthropic-style tool-call schema yet, so if you’ve built around Claude’s tool format you’ll need a translation layer.
How I’d Actually Use This
I’m not migrating my Claude work to V4-Pro. I’m doing something more boring: routing.
The setup I’ve landed on after three days of testing:
- V4-Flash as the default agent model for steps where the task is simple and well-scoped: file reads, search, simple edits, summarization. Roughly 70% of agent calls in a typical Claude Code session.
- V4-Pro for code generation and refactors where the spec is clear and the action is bounded.
- Claude Opus 4.7 fallback for anything involving multi-step planning, ambiguous specs, computer-use, or production-critical correctness.
A simple router based on task classification (the LLM-as-router pattern) handles the dispatch. My measured cost on a week of typical workload dropped about 62% versus pure Opus 4.7, while the success rate stayed within a couple of points of all-Opus.
The teams that should go all-in on V4-Pro are the ones whose budgets are the binding constraint — early-stage startups burning $5K/month on Anthropic, indie devs hitting personal API caps, anyone running batch workloads at scale where latency doesn’t matter. For those teams, the savings are large enough to fund several engineers.
The teams that should mostly stay on Claude or GPT-5.5 are the ones building user-facing agents where reliability, tool-use precision, and safety matter more than per-token cost. The 5-7x effective savings doesn’t pay for itself if you’re spending the difference on guardrails, retry logic, and customer-support escalations.
What to Watch Next
The question isn’t whether V4-Pro is good. It clearly is. The question is whether DeepSeek can sustain this pricing — or whether the platform tier is being subsidized to grab share, and rates climb in three months. I’d plan around both scenarios. Build your stack so the model is swappable, keep your prompts portable, and don’t bake DeepSeek-specific behaviors into your agent design.
If you’ve got a coding workload running on Opus 4.7 right now, the cheapest experiment is honest: take a representative trace, replay it against V4-Pro through their OpenAI-compatible endpoint, and measure the success rate yourself. That’s two hours of work and it’ll tell you more than any benchmark table — including this one.