Skip to main content
Logo
Overview

Grok 4.3 Review: Does It Beat GPT-5.5 and Claude Opus 4.7?

May 13, 2026
11 min read

xAI shipped Grok 4.3 on May 1, 2026 (API first, with the consumer app catching up over the following two days), and the pitch is unusually blunt for a model launch: it’s cheaper, it reasons by default, it watches video now, and it comes bundled with a voice-cloning suite that builds a usable clone of your voice in under two minutes. No “next frontier” speech, no benchmark victory lap. Just a price tag — $1.25 per million input tokens, $2.50 per million output — and a dare to switch.

So this Grok 4.3 review is really two questions stacked on top of each other. Is the model good enough to take seriously next to GPT-5.5 and Claude Opus 4.7? And even if it isn’t the smartest thing on the leaderboard, does the pricing make it the right call for a chunk of real workloads anyway? Those have different answers, and that’s the interesting part.

What actually changed in Grok 4.3

The headline feature, weirdly, is the absence of a feature. Grok 4.3 has no reasoning toggle anymore. Previous Grok versions made you choose between a fast non-thinking mode and a slower deliberate one; 4.3 just reasons, always, and adjusts how hard it works based on the request. In practice this means you stop babysitting a reasoning_effort parameter and stop getting bitten when you forget to turn thinking on for a hard prompt. It’s a small quality-of-life thing that adds up over a few thousand API calls.

Then there’s tool calling, which xAI moved server-side and scaled up. Grok 4.3 can fan out to multiple tool calls in parallel within a single turn — xAI describes it as parallel agent scheduling, and the number floating around is up to 16 concurrent sub-tasks per execution. If you’ve ever written an agent loop that does ten sequential web searches and then synthesizes, you know how much of the wall-clock time is just waiting. Running those concurrently on xAI’s side instead of yours is a genuine speedup for research-style agents.

Video understanding is new. Grok 4.3 takes text, images, and now video as input — you can hand it a clip and ask what’s in it, transcribe and summarize a screen recording, or pull structured data out of a demo video. It’s not a video generation model (that’s the Imagine API, more on that below), it just reads video the way 4.x already read images.

The smaller upgrades

A few more: native document output, so the model can return a PDF, an Excel file, or a PowerPoint deck directly instead of you post-processing markdown into one. A 1M-token context window, putting it level with GPT-5.5 and roughly in Gemini territory for stuffing whole repos or document sets into a prompt. Output capped at about 30K tokens per response. Knowledge cutoff around November 2024, which is older than you’d like — lean on tool calls and retrieval for anything recent. And it’s quick: Artificial Analysis clocks Grok 4.3 at roughly 87 output tokens per second, ahead of both GPT-5.5 (around 64) and Claude Opus 4.7 (around 60) at their high-effort settings.

The pricing math, and where the cut actually helps

xAI called it a 40% price cut. Depending on which model you’re measuring against and how your traffic splits between input and output, the real reduction lands somewhere between 40% and 80% — the predecessor’s output pricing in particular came way down. What matters is the resulting number, and it’s this: $1.25 in, $2.50 out, per million tokens, with a tiered bump once a request crosses roughly 200K total tokens.

Put that next to the competition as of May 2026:

  • Grok 4.3 — $1.25 / $2.50
  • Claude Opus 4.7 — $5 / $25
  • GPT-5.5 — $5 / $30 (standard tier; batch knocks it to $2.50 / $15, priority pushes it to $12.50 / $75)

On output tokens, where most agent and chat workloads spend their money, Grok 4.3 is 10x cheaper than Claude Opus 4.7 and 12x cheaper than GPT-5.5 at standard rates. That’s not a rounding-error difference. If you’re running a high-volume summarization pipeline, a customer-facing chatbot, or an agent that generates a lot of text, the bill genuinely changes character.

Where the cut helps less: anything where the model’s quality is the bottleneck. If GPT-5.5 one-shots a task that Grok 4.3 takes three tries to get right, you’ve spent more tokens and more of your own time, and the per-token discount evaporates. Hard agentic coding is the obvious case — more on that in the benchmarks section. The discount is real, but it’s a discount on the thing Grok does, not a license to use it where it’s the wrong tool.

One more number worth knowing: xAI hands every developer up to $175/month in free API credits through its data-sharing program. That’s the most generous free tier among the major providers right now, and it makes “just try it on your actual workload” basically free for small teams. The catch is the data sharing — read the terms before you point production traffic at it.

Custom Voices and the Voice API

This is the part of the launch that got the most attention, and it deserves it. xAI’s Custom Voices feature clones a voice from about a minute of natural speech and produces a usable model in under two minutes. Once it exists, that voice works across the Text-to-Speech API and the Voice Agent API, and there’s no surcharge for using a custom voice over a preset — the library ships with 80+ presets across 28 languages if you don’t want to clone anything.

The Voice API itself is priced at $4.20 per million characters, and xAI claims that’s 86–92% cheaper than OpenAI’s comparable voice offering. Against ElevenLabs, the dedicated incumbent, it undercuts on raw cost too — though ElevenLabs still has a deeper bench on emotional control, multilingual nuance, and the long tail of production features studios actually use. If you’re building a voice agent and cost is the constraint, Grok’s worth a serious look. If you’re producing an audiobook or a polished narration, the specialists still earn their premium.

Now the part that should make you pause. Voice cloning that fast, that cheap, with that low a friction floor is exactly the capability that makes fraud and impersonation scale. xAI’s answer is a two-stage consent gate: you read a live verification phrase that the speech-to-text engine transcribes and matches in real time (proving you’re present and intending to do this), and then the system compares speaker embeddings from that verification clip against the full recording to confirm they’re the same person. The claim is that this blocks cloning a third party from some pre-existing recording you scraped off YouTube.

It’s a reasonable design. It is also, as of this writing, an unaudited one — xAI hasn’t published false-acceptance rates, hasn’t released red-team results, and a determined attacker recording a target reading a phrase out loud is not a wildly exotic threat model. Treat the consent gate as a speed bump that filters out lazy misuse, not as a guarantee. And if you’re shipping a product on top of this API, your own terms of service and abuse monitoring matter more than xAI’s, because you’re the one users will blame.

The Imagine API, briefly

Bundled into the same release is the Imagine API — xAI’s image-and-video generation stack, covering text-to-image, text-to-video, image-to-video, editing, and audio in one set of endpoints. The number that travels: video generation with audio runs about $4.20 per minute of output, roughly a third of what Google charges for Veo 3.1 Preview.

Reality check, because the pricing always leads these announcements: cheap-per-minute is great until you factor in retries. Generative video is still a slot-machine workflow — you prompt, you don’t quite get it, you re-roll. A model that’s a third the price but takes twice the attempts to land a usable shot isn’t actually cheaper. Veo and Runway have had more iterations of polish on prompt adherence and motion coherence, and that shows up in how many generations you throw away. Grok Imagine is a credible entry and the price is aggressive, but “cheapest per minute” and “cheapest per usable minute” are different metrics, and the second one is the one that hits your card.

For consumer use, image generation is unlimited on the SuperGrok tier with around 100 video renders a day, and the X Premium+ subscription bundles a credit allotment too.

Benchmarks versus the hype

Here’s where the honest version of this review diverges from the launch-day churn. Grok 4.3 is good. It is not the smartest model you can buy.

On the Artificial Analysis Intelligence Index — a blended score across reasoning, math, code, and agentic tasks — Grok 4.3 lands at 53. For context: GPT-5.5 sits at 60, Claude Opus 4.7 and Gemini 3.1 Pro Preview at 57, and the previous Grok generation at 49. So 4.3 is a real step up over its predecessor and it clears Claude Sonnet 4.6, but it’s a clear tier below the frontier flagships. Anyone telling you Grok 4.3 “beats GPT-5.5” is reading a different leaderboard than the one most people use.

It’s strong in specific places. On τ²-Bench Telecom — instruction-following plus agentic customer support — Grok 4.3 hits 98%, which is excellent and directly relevant if you’re building support agents. On GDPval-style agentic tasks it scores around 1500 Elo, a big jump from the prior version and ahead of Gemini 3.1 Pro Preview, though still trailing GPT-5.5 by a meaningful margin (a couple hundred Elo points). Plain instruction-following on IFBench is more middling at 81%.

Coding is the weak spot relative to the top tier. Claude Opus 4.7 still owns SWE-bench Verified at 87.6% and SWE-bench Pro at 64.3%, and for the kind of multi-file, run-the-tests-and-fix-it agentic coding that those benchmarks measure, Opus is worth its higher price. Grok 4.3 is fine for everyday code generation, boilerplate, and quick scripts; it’s not the model I’d hand a gnarly refactor to.

The argument for Grok 4.3 isn’t intelligence, it’s intelligence-per-dollar. Running the full Artificial Analysis benchmark suite costs about $395 on Grok 4.3, roughly 20% less than its predecessor and a fraction of what the frontier models cost to evaluate. If you plot capability against price, Grok 4.3 sits on a good part of the curve — not the top, but a spot a lot of workloads should be perfectly happy with.

Who should switch, and who shouldn’t

Switch (or at least run a real bake-off) if: you’re a cost-sensitive API builder running high-volume text generation, summarization, classification, or chat — the 10–12x output-token discount over GPT-5.5 and Opus 4.7 is the whole argument and it’s a strong one. Or you’re building voice agents and the Voice API’s pricing makes the unit economics work where ElevenLabs or OpenAI didn’t. Or you’re a customer-support shop and that 98% on τ²-Bench Telecom maps to your actual use case. Use the $175/month free credits to test on your traffic before committing.

Don’t bother if: your bottleneck is the hardest 10% of tasks — agentic coding, multi-step reasoning, anything where a smarter model finishes in one pass. Pay for Opus 4.7 or GPT-5.5 there; the per-token savings won’t cover the extra retries and engineer-hours. Same if you’re an enterprise buyer who needs the data-handling guarantees, SOC 2 paperwork, regional controls, and SLA track record that the bigger providers have spent years building — Grok’s enterprise story is younger, and “we ship fast” cuts both ways. And if you’re an X Premium+ or SuperGrok subscriber, just know that full Grok 4.3 access on the consumer side rolled out in stages, with the top tier (the $300/month Heavy plan) getting everything first; check what your plan actually includes before assuming you have it.

The thing Grok 4.3 gets right is knowing what it is. It’s not trying to win the benchmark — it’s trying to be the model you reach for when “good enough, fast, and a quarter the price” describes your problem better than “the absolute best, cost no object” does. That’s a real lane, and a lot of production traffic lives in it.

If you’ve got an API workload where output tokens dominate the bill, spend an afternoon running it through Grok 4.3 on the free credits and compare the numbers — not the leaderboard numbers, your numbers. That’s the only benchmark that decides this one.


Pricing, benchmark scores, and feature availability are as of May 2026 and move fast in this category — check xAI’s API docs and the live Artificial Analysis leaderboard for current figures before you commit a workload.