For about three years the honest answer to “can a free open-weight model replace my GPT or Claude subscription?” was no, not really, not for the work you actually get paid for. That answer stopped being true sometime this spring, and June 2026 made it obvious.
On June 17, Z.ai dropped GLM-5.2 — a 753-billion-parameter model, MIT license, 1M-token context — and it landed at the top of the Artificial Analysis Intelligence Index among open weights. It beats GPT-5.5 on several long-horizon coding benchmarks at roughly a sixth of the cost. That’s not a budget alternative anymore. That’s the frontier, with the weights downloadable.
So the question isn’t “is open-weight good enough” — it is. The question is which one, for what, and whether you should rent it or run it. I’ve been moving real agent workloads onto these models for the past few weeks, and the picks are not interchangeable.
Why mid-2026 is the inflection point
Two things happened at once. The open-weight models got genuinely good at the hard stuff — long, multi-step agentic coding where a model has to plan, call tools, recover from its own mistakes, and not lose the thread over a 40-minute session. And the price floor cratered.
DeepSeek V4 in April reset what “cheap” means: V4-Flash runs around $0.14 in / $0.28 out per million tokens, and V4-Pro sits at roughly $1.74 / $3.48. When the Chinese labs ship a model, they ship it with pricing that makes the incumbents look like they’re charging rent.
Meanwhile the proprietary gap narrowed to something you can measure in single-digit percentages on most tasks. GLM-5.2 trails Claude Opus 4.8 by somewhere between one and thirteen points depending on the benchmark — and Opus 4.8 is the best coding model on the planet right now. Thirteen points behind the absolute best, while costing a fraction and running on your own hardware, is a trade a lot of teams will happily take.
If you’ve already read my DeepSeek V4 Pro review or the Gemma 4 local setup walkthrough, this is the roundup that puts the whole field side by side.
The contenders, quickly
Here’s the field worth caring about, and what each one is actually for.
GLM-5.2 (Z.ai) — The new open-weight intelligence leader. 753B parameters, MIT, 1M context. Architecture changes (sparse attention via what they call IndexShare, improved multi-token prediction, heavy agentic RL) are tuned specifically for long-horizon work. On Terminal-Bench 2.1 it jumped to 81.0 from GLM-5.1’s 62.0 — that’s not an incremental bump, that’s a different model.
DeepSeek V4 Pro / Flash — The price-performance king and the competitive-programming specialist. V4 bet everything on cheap, fast algorithmic reasoning. If your workload is “solve well-defined problems at scale for as little money as possible,” this is the default. MIT license.
Kimi K2.6 (Moonshot) — The stability pick. Where GLM wins on raw benchmark intelligence and DeepSeek wins on price, Kimi wins on not falling apart during long agent sessions: recoverable failure modes, consistent tool calling, dependable real-world software-engineering behavior. The newer K2.7 Code variant (June 13) is coding-specialized and cuts thinking tokens by about 30% versus K2.6, which matters when you’re paying per token and waiting on output.
Qwen 3.5 (Alibaba) — The safe, Apache-2.0, broadly-supported workhorse. The 397B reasoning variant scores competitively, and Qwen3-Coder-Next is a solid agentic-coding option. Qwen’s real advantage is ecosystem: it’s the most thoroughly supported open family across tooling, fine-tunes, and quantizations.
Llama 4 Maverick (Meta) — Reaches a 1M-token context window and remains the Western default for shops that want a name-brand, well-documented model with an enormous fine-tuning community. It’s no longer the benchmark leader, and the license is the catch (more below).
The honorable mentions — MiniMax M3 (released June 2026, strong on agentic coding) and Gemma 4 (Google’s small-but-mighty local option) round out the field. MiniMax M3 actually edges DeepSeek V4 Pro on some real-world agentic metrics, which tells you how crowded the top has gotten.
License reality check — read this before you ship
Benchmarks are fun. Licenses are what get you sued. The word “open” is doing a lot of unearned work across this field, so here’s the part nobody puts in the comparison table.
MIT (GLM-5.2, DeepSeek V4) — As permissive as it gets. Use it, modify it, ship it commercially, no revenue threshold, no usage report, no asterisk. Z.ai’s own docs make a point of saying the license guarantees “no regional limits.” For a startup that wants to embed a model in a product and never think about it again, MIT is the whole ballgame.
Apache 2.0 (Qwen 3.5) — Effectively as free as MIT for commercial purposes, with an explicit patent grant that some corporate legal teams specifically want to see. If your lawyers care about patent retaliation clauses, Qwen is the easy yes.
Llama community license (Llama 4) — Not actually open source, whatever the marketing says. There’s an acceptable-use policy, and the big one: if your product crosses 700 million monthly active users you have to negotiate a separate license with Meta. Almost nobody hits that. But “almost nobody” is a different promise than MIT’s “nobody, ever,” and it’s why a lot of teams quietly moved off Llama once the Chinese MIT models caught up on quality.
The practical takeaway: if license cleanliness matters at all — and for anything you’re shipping commercially, it should — GLM-5.2 and DeepSeek give you frontier-ish quality with zero strings. That combination didn’t exist a year ago.
Benchmarks that actually separate them
I’ll skip the leaderboard cosplay and focus on the numbers that predict whether a model will survive your real workload.
On agentic coding — the long-horizon stuff — GLM-5.2 leads the open field on GDPval-AA v2 (a real-world agentic metric) at 1524, ahead of MiniMax M3 at 1418 and DeepSeek V4 Pro at 1328, and effectively level with GPT-5.5’s 1514. On SWE-bench Pro it posts 62.1. These are the scores that matter if you’re building a coding agent, because they measure sustained tool use, not one-shot trivia.
On raw coding benchmarks, BenchLM’s Chinese-model leaderboard has DeepSeek V4 Pro (Max) on top at 87, then GLM-5.1 at 83, Kimi K2.6 and GLM-5 Reasoning at 81, and Qwen3.5 397B at 79. Note that’s GLM-5.1 in that particular ranking — the 5.2 numbers landed after, and they move it up substantially.
On long context, both GLM-5.2 and Llama 4 Maverick hit 1M tokens. In my experience the headline context number and the usable context number are different animals — most of these models degrade well before their advertised ceiling — but for genuinely large codebases or document sets, that 1M window is the difference between “fits” and “doesn’t.”
The honest summary: GLM-5.2 is the best all-rounder, DeepSeek wins tight algorithmic problems on price, Kimi is the most reliable over long sessions, and Qwen is the one with the fewest surprises in production. None of them is best at everything, which is exactly why the “pick by use case” framing beats any single ranking.
Run it yourself vs rent it
This is where most “best open model” posts wave their hands, so let’s be specific.
Renting (managed inference) is what 90% of you should do, at least to start. GLM-5.2, DeepSeek V4, Kimi, and Qwen are all available across the usual providers — Fireworks, Together, OpenRouter, Baseten, DeepInfra, Groq, and the hyperscaler catalogs. You get an API key, you point your agent at it, you pay per token. No GPUs, no ops. I covered the provider landscape in detail in the AI inference providers guide, and the short version is that for open weights, the provider you pick changes your latency and price more than people expect — the same model can vary 2-3x in throughput across hosts.
Self-hosting is a real option now, but be clear-eyed about the bar. A model like GLM-5.2 wants 4 to 8 H100-class GPUs to serve at full context with acceptable throughput. That’s a serious capital or rental commitment — call it several thousand dollars a month on a cloud like RunPod, Lambda, or your hyperscaler of choice, before you’ve served a single external user. Smaller models (Gemma 4, quantized Qwen variants) run on a single high-end GPU or even a beefy workstation, which is why they remain the right entry point for local experimentation.
One warning from the trenches: a model being downloadable on day one doesn’t mean the serving stack is production-grade on day one. Plan for an extra few weeks of maturation — inference kernels, quantizations, and framework support all lag the weight release. If you need GLM-5.2 self-hosted and stable this week, you’ll fight more than you want to.
What it actually costs
Let’s put numbers on the decision, because “open weight is cheaper” is true but useless without specifics.
Say you’re running a coding agent that burns 50 million tokens a month — a reasonable figure for one developer leaning on it hard, split maybe 30M in / 20M out.
- DeepSeek V4-Flash via API: roughly $4 in + $6 out = about $10/month. That is not a typo.
- DeepSeek V4-Pro: around $52 in + $70 out = roughly $122/month.
- GLM-5.2 via API: standalone token pricing wasn’t fully published at launch, but it’s tracking the $1-2 in / $3-6 out range based on GLM-5.1 — call it $90-150/month at this volume. Z.ai also sells a Coding Plan starting around $18/month that bundles GLM-5.2 access for coding workflows, which undercuts almost everything if your usage fits inside it.
- A frontier proprietary subscription (GPT-5.5 or Claude at comparable usage): generally $200+/month once you’re working at agent scale, and more if you blow past plan caps into API overages.
Self-hosting only beats managed API economics at high, steady volume — the breakeven is somewhere north of a few hundred million tokens a month, where the fixed GPU cost amortizes against per-token savings. Below that, you’re paying for idle silicon. Above it — and for anyone with hard data-residency or privacy requirements — owning the boxes starts to make sense, and that’s often the real reason teams self-host: not cost, but never sending a token off-premises.
Which one should you actually pick
Here’s how I’d choose, by what you’re doing.
Building a coding agent and want the best open option → GLM-5.2. It tops the open-weight agentic benchmarks, the MIT license means you ship without lawyers, and it’s close enough to Opus 4.8 that the gap won’t be your bottleneck.
Cost is the dominant constraint, work is well-defined → DeepSeek V4-Flash for volume, V4-Pro when you need more reasoning. Nothing else competes on price per useful token, and the MIT license is just as clean.
Long, autonomous agent sessions where reliability beats peak IQ → Kimi K2.6, or K2.7 Code if your workload is coding-heavy and you want fewer thinking tokens. It’s the one I trust to not derail at minute 35.
On-prem, privacy-critical, or you want the most battle-tested ecosystem → Qwen 3.5 (Apache 2.0). The broadest tooling support and the fewest production surprises.
You want a Western name brand and a huge fine-tune community → Llama 4 Maverick, with eyes open about the community license.
Just experimenting locally on one GPU → Gemma 4 or a quantized Qwen. Start cheap, graduate to the big models once you know the workload.
The thing to internalize is that “free model vs paid model” is the wrong frame now. The real choice is which open model fits your shape of work — and for a growing share of teams, the paid subscription is the line item that’s getting cut, not the open weights.
If you’ve been running everything through a single proprietary API out of habit, pick one workload this week — your cheapest, highest-volume one — and point it at DeepSeek V4-Flash or a GLM-5.2 endpoint. Compare the bill at the end of the month. That number tends to end the debate faster than any benchmark.
Sources: VentureBeat on GLM-5.2 vs GPT-5.5, Artificial Analysis: GLM-5.2 leads open weights, BenchLM Chinese LLM leaderboard, Kilo open-source coding models, DeepSeek API pricing. Benchmarks and pricing as of June 2026 — check the official model cards and provider pages for current figures.