Skip to main content
Logo
Overview

Microsoft MAI-Thinking-1 and MAI-Code-1-Flash at Build 2026

June 4, 2026
10 min read

Microsoft just showed up to its own party with models it built itself. At Build 2026 on June 2, the company rolled out seven in-house MAI models — a reasoning model, two coding models, image, voice, and transcription — all trained from scratch, none of them riding on OpenAI’s coattails. For a company whose entire AI story for three years was “we have a big stake in OpenAI,” that’s a real shift.

The two that matter most if you write code or make decisions about AI infrastructure are MAI-Thinking-1, Microsoft’s first large reasoning model, and MAI-Code-1-Flash, a tiny coding model that’s already landing in GitHub Copilot. The benchmark claims are aggressive. The strategy behind them is more interesting than the benchmarks.

Let me walk through what’s real, what you can touch today, and whether any of it should change your setup.

The short version of why this happened

Back in October 2025, Microsoft and OpenAI restructured their relationship. The exclusive, all-in dependency turned into something analysts keep calling “coopetition” — Microsoft still has access to OpenAI’s models, but it’s no longer betting the whole house on them. Mustafa Suleyman, who runs Microsoft AI, has been openly talking about self-sufficiency since early this year.

Build 2026 is where that talk turned into shipped weights. Seven models, the broadest single release in the company’s history, and the framing was unmistakable: Microsoft wants optionality. It wants to be able to serve Copilot, Office, and Azure customers on its own silicon-friendly models when that’s cheaper, and reach for OpenAI or Anthropic when it’s not.

That’s the lens to read every benchmark claim through. These models aren’t trying to be the smartest thing alive. They’re trying to be good enough at a fraction of the cost — and Microsoft controls where they run.

MAI-Thinking-1: a mid-sized reasoner that punches up

MAI-Thinking-1 is the headliner. It’s a sparse Mixture-of-Experts model with 35 billion active parameters and a 256K-token context window. The detail Microsoft keeps repeating is that it was trained entirely on commercially licensed data, with no distillation from any third-party model. No quietly learning from GPT outputs. That’s partly a legal-cleanliness flex and partly a jab at how a lot of “independent” models actually get trained.

The numbers Microsoft published:

  • AIME 2025: 97.0% and AIME 2026: 94.5% — these are competition-math benchmarks that test multi-step reasoning, and those scores are frontier-tier.
  • SWE-Bench Pro: Microsoft says it matches Claude Opus 4.6 on coding tasks.
  • In blind side-by-side evaluations run by Surge (Microsoft’s independent human-rating partner), raters preferred MAI-Thinking-1 over Claude Sonnet 4.6.

Read those carefully, because the framing is doing work. “Matches Opus 4.6 on SWE-Bench Pro” and “preferred over Sonnet 4.6 in blind tests” are two different comparison points against two different Claude tiers. A 35B-active MoE matching Opus-class coding while being cheap to run would genuinely matter. I’d want to see independent reproductions before I treat it as settled — vendor benchmarks have a way of choosing the kindest cuts — but the AIME scores alone tell you this isn’t a toy.

The real pitch is buried in the size. 35B active parameters means it’s cheap per token compared to the frontier giants, and Microsoft hammered the efficiency angle all keynote. One MAI variant tuned for Excel reportedly matches GPT-5.4 while running up to ten times more efficiently; a version tuned to McKinsey’s enterprise standards hit roughly ten times lower cost. Whether those specific numbers hold up, the direction is clear: Microsoft is optimizing for cost-per-useful-output, not leaderboard bragging rights.

MAI-Code-1-Flash: the 5B model that’s already in Copilot

This is the one most developers will actually run first, because it’s not in a private preview — it’s shipping into GitHub Copilot.

MAI-Code-1-Flash is a 5-billion-parameter coding model. Five billion. That’s small enough to be genuinely cheap to serve at scale, which is the entire point. And the benchmark claim is the kind that makes you double-check the footnotes:

  • SWE-Bench Pro: 51.2% vs Claude Haiku 4.5’s 35.2% — a 16-point lead.
  • It solves harder problems with up to 60% fewer tokens on SWE-Bench Verified versus Haiku.
  • Microsoft says it beats Haiku 4.5 across all four coding benchmarks tested, plus instruction-following margins ranging from +14.5 to +28.9 points.

A 5B model beating Haiku 4.5 by 16 points on SWE-Bench Pro is a bold claim, and the comparison is pointed — Haiku is Anthropic’s small-and-fast tier, the exact slot Microsoft wants to fill in its own stack. If it holds, this is the model Copilot reaches for on autocomplete, quick edits, and the thousand small tasks that don’t need a frontier brain.

Here’s the catch I’d flag: small models are great at well-scoped, common-pattern tasks and fall apart on the gnarly, sprawling, “this touches eleven files and a race condition” problems. SWE-Bench Pro is harder than vanilla SWE-Bench, so 51% is respectable — but it’s still roughly a coin flip on real bugs. Treat MAI-Code-1-Flash as a fast assistant for the routine 70% of your work, not the model you hand your hairiest refactor to.

It’s rolling out to GitHub Copilot individual users in VS Code, both in the model picker and under the default auto picker — so it may start handling your requests without you choosing it. There’s also MAI-Code-1, the full-size sibling tuned specifically for GitHub workflows, available now in Copilot and VS Code.

The other four, quickly

The lineup wasn’t just reasoning and code. Rounding out the seven:

  • MAI-Image-2.5 and a Flash variant — debuted at #3 on the Arena.ai text-to-image leaderboard, and #2 on image-to-image, where Microsoft says it surpassed Nano Banana 2.
  • MAI-Transcribe-1.5 — 43 languages, top spot on the FLEURS speech benchmark, with streaming coming.
  • MAI-Voice-2 and its Flash variant — voice cloning and prompting across 15-plus additional languages.

None of these dethrone the category leaders, but that’s not the goal. Microsoft now has a full house — text, code, image, voice, speech — it can run end to end without paying another lab. That’s the strategic story the individual benchmarks miss.

Why “fewer tokens” is the claim that actually matters

Buried under the pass-rate headlines is the number that hits your bill: token efficiency. MAI-Code-1-Flash doesn’t just claim a higher SWE-Bench score than Haiku 4.5 — it claims to reach harder solutions using up to 60% fewer tokens. That second part is the one I’d care about if I were running an agentic coding loop at any volume.

Here’s why. When you run a coding agent, it doesn’t make one call — it makes dozens. Read a file, reason, propose an edit, run a test, read the error, try again. Every loop burns input and output tokens, and the bill scales with how many tokens the model chews through to get to a working answer. A model that’s slightly less accurate but dramatically more token-frugal can come out cheaper and faster on real workflows, because it spends fewer round-trips flailing. That’s the whole reason a 5B model can be commercially interesting against a model many times its size.

The flip side is the part Microsoft won’t put on a slide: token efficiency on a benchmark is measured on tasks the model can actually solve. On the tasks it can’t, a small model can burn just as many tokens looping uselessly before it gives up or hands you something broken. So the efficiency win is real on the routine work and evaporates on the hard stuff — which is exactly the work distribution a “Flash” tier model is built for. Match the tool to the task and the savings are genuine. Reach for it on the wrong task and you pay in re-prompts and debugging time, which never show up in a token count.

Where you can actually use them today

This is where the hype meets reality, and the two flagship models land in very different places.

MAI-Code-1-Flash and MAI-Code-1 are the accessible ones. If you’re a GitHub Copilot individual user on VS Code, they’re showing up in your model picker now, with a progressive rollout. No setup, no waitlist. If you don’t see them yet, give it a few days.

MAI-Thinking-1 is open on Microsoft Foundry in private preview — meaning you request access and wait. Not something you can wire into production this week.

Microsoft also said the MAI models will be available on Fireworks AI, Baseten, and OpenRouter, and that Fireworks AI is now generally available on Foundry. That last part is quietly the most useful announcement for a lot of teams: a single platform with enterprise governance and Azure data residency, regardless of which model you pick. If you’ve been juggling model access across three providers, consolidating routing matters more than any one model’s AIME score.

One thing Microsoft did not disclose: pricing. No per-token numbers for any MAI model. Given that the entire pitch is cost efficiency, that’s a strange omission, and it means you can’t do a real total-cost comparison against Claude or GPT yet. “Cheaper” is a claim until there’s a price sheet.

So should you switch anything?

Honestly, for most people, not yet — but there’s one easy experiment worth running.

If you use Copilot in VS Code, try MAI-Code-1-Flash on your everyday work for a week. It costs you nothing to select it, and the failure mode is mild: if it whiffs on a task, you re-prompt with a bigger model. The thing to watch is whether the speed and lower token cost actually translate into a smoother loop, or whether you spend the savings re-asking. My bet is it’ll be genuinely good for boilerplate, scaffolding, and small edits, and inconsistent on anything that needs real cross-file reasoning. Test it against that hypothesis.

For MAI-Thinking-1, there’s nothing to do but get on the Foundry preview list if you’re curious, and wait for independent benchmarks. I wouldn’t re-architect anything around a model I can’t price and can’t freely access. The AIME numbers are impressive enough that it’s worth tracking, not worth betting on.

And if you’re a decision-maker watching the OpenAI-versus-Microsoft drama: the takeaway isn’t “Microsoft is dumping OpenAI.” It’s “Microsoft now has leverage it didn’t have in 2024.” That’s good for you. More credible in-house models from a hyperscaler means more pricing pressure across the board, and you’re the one who benefits when Anthropic, OpenAI, and Microsoft are all fighting to be the cheap-and-good option in the same Copilot dropdown.

What I’m watching next

A few things will tell us whether this is a genuine inflection or a well-staged demo:

Independent SWE-Bench reproductions of MAI-Code-1-Flash’s 51% claim, from someone other than Microsoft. If a 5B model really beats Haiku 4.5 by 16 points, that reshapes the cheap-coding-model tier and people will pile on to verify fast.

MAI-Thinking-1 pricing and general availability — the preview-to-GA gap, and whether the cost story survives contact with a published price sheet.

And the obvious competitive response. Anthropic and Google don’t sit still, and a Gemini or Claude refresh aimed squarely at the cost-efficiency lane would tell you Microsoft hit a nerve.

For now, the move is small: flip your Copilot model to MAI-Code-1-Flash for a few days and see if a 5B model can quietly handle most of what you throw at it. That’s the most honest benchmark there is.

Sources: