Skip to main content
Logo
Overview

AI Inference Providers 2026: Honest Comparison Guide

May 6, 2026
12 min read

The inference layer is where your AI bill actually gets paid. You can spend a week obsessing over which gateway to put in front of your LLMs, which observability platform to wire up, which durable orchestrator to run your agents on — and then the per-token meter runs anyway, against whichever provider you picked underneath all of it.

I’ve been routing real workloads through most of the providers below for the past year. Some of them are genuinely great at one thing and mediocre at everything else. A few are quietly the right default for almost any team. And a couple are still selling 2024-era pricing in 2026 like nobody noticed. Here’s how I’d actually pick today.

The 2026 split: custom silicon vs GPU platforms

The inference market split into two camps and the gap got wider this year, not narrower.

On one side: custom silicon. Groq’s LPU, Cerebras’ wafer-scale WSE, SambaNova’s RDU. These chips were designed top-down for transformer inference, and they hit numbers GPUs can’t touch on the workloads they support. Groq’s LPU is doing roughly 750 tokens/sec on Llama 4 70B decode with time-to-first-token under a second. Cerebras runs gpt-oss-120B at around 3,000 tokens/sec — fast enough that it changes what you can build. SambaNova hosts Llama 3.1 405B at $5 per million input tokens, which is the cheapest 400B-class inference on the public market.

On the other side: GPU platforms. Together AI, Fireworks, Baseten, Modal, RunPod, DeepInfra, novita.ai, Hyperbolic, OctoAI, Anyscale, Replicate. They run on H100s, H200s, B200s, and increasingly MI300X. They’re slower per-token than Groq or Cerebras, but they support every model anyone publishes, they let you fine-tune, they expose batch APIs, and they don’t lock you into a chip vendor’s roadmap.

The split matters because once you understand it, most of the “which is better” arguments evaporate. Custom silicon is incredible when your workload fits the catalog. GPU platforms win the moment you need flexibility — and most production AI products eventually do.

Latency vs throughput: when speed actually matters

Speed sells. It also gets oversold. Before you pay a premium for tokens-per-second, ask what your user is doing while the tokens stream.

For a chat interface where humans read at ~5 words/sec, anything past about 50 tokens/sec is invisible. The user can’t tell. What they can tell is the first 800 milliseconds of dead air before the first token shows up. That’s where Groq still feels magical — TTFT around 0.6–0.9 seconds even on big open-weight models, where most GPU providers are landing at 1.5–2.5 seconds with cold caches.

For voice agents, the math flips. End-to-end latency under 600ms is the threshold where conversation feels natural, and you only get there if your inference is sub-200ms TTFT. Groq is the obvious pick. Cerebras is faster on raw decode but its first-token isn’t always faster than Groq’s, and you should benchmark on your actual prompts before committing.

For agent loops with long tool chains — fifteen LLM calls to plan, execute, reflect, replan — what you want is throughput per dollar, not blazing speed on any single call. Together’s batch tier and Fireworks’ on-demand pricing both win here, and Cerebras has been quietly competitive on Llama 4 if your agent runs an open-weight stack.

For batch jobs (RAG indexing, dataset labeling, offline classification), nobody beats Together’s batch API at scale. Its 50% discount for async workloads is the lowest sustained price-per-token you can get on hosted Llama 4 outside of fully-reserved capacity.

Pricing: the 70B-tier converged, the 400B-tier didn’t

Hosted Llama 4 70B-class inference has converged hard. Most providers are sitting between $0.40 and $0.90 per million input tokens, with output 2–3x that. Hyperbolic at $0.40 is the floor. Together’s batch tier sits in the same range. Fireworks lands around $0.90 but throws in structured-output guarantees and JSON-mode reliability that the cheaper tiers don’t match. DeepInfra and novita.ai compete on the commodity end and will undercut anyone if you don’t need DX features.

The 400B+ tier is where the spread is wider. Together hosts Llama 4 405B at around $3.50/M input. SambaNova’s $5/M for the same model class is more expensive but materially faster. The closed-weight frontier (GPT-5.5, Claude Opus 4.7, Gemini 3.1 Ultra) lives in a different price universe — anywhere from $3 to $15 per million input tokens depending on model and tier — and you’re going to your hyperscaler or model vendor for those, not to an inference provider.

Image generation has its own commodity tier. Imagen 4 Fast on Vertex hits $0.02/image, which is roughly the new floor for production image gen at scale. Replicate, Fal, and Together all have competitive batch tiers. If you’re generating thousands of images a day, the per-image cost differences add up faster than you’d expect — model your monthly volume before you commit.

A note on caching: prompt caching pricing varies wildly. Anthropic, OpenAI, Together, and Fireworks all offer some form of cache hit discount, but the structures differ. If your workload has a stable system prompt above 2K tokens (common for agents and RAG), the caching discount can dwarf the base per-token cost. Read the fine print before you compare list prices.

Free tiers that actually cover prototypes

The free tiers shifted from “demo only” to “ship a real prototype” sometime in 2025, and most teams haven’t caught up.

  • Cerebras Cloud gives you 1–2 million tokens/day on Llama and gpt-oss models. Enough to run a small internal tool indefinitely.
  • Groq has a daily quota that’s more than generous for most prototype apps. Rate limits are real but the cap resets daily.
  • Google AI Studio still has near-unlimited Gemini access for development purposes, though they’ve started tightening on multimodal calls. The free tier is the cheapest way to test an agent that needs vision.
  • Together gives credits on signup but you’ll burn them quickly on real traffic.
  • OpenRouter routes free-tier model variants from a few providers if you want to test without picking a vendor.

The catch with free tiers: the API surface sometimes diverges from paid. Cerebras’ free tier doesn’t expose the same batch endpoints. Groq’s quota is per-account, not per-key. Don’t build a prototype on a free-tier-only feature and then discover the paid tier behaves differently.

Fine-tuning: where the GPU platforms quietly win

Custom-silicon providers don’t really do fine-tuning. Groq runs what Groq deploys; Cerebras has limited training services for enterprise contracts only. If you need to fine-tune a model and serve it cheaply, you’re on a GPU platform.

Together and Fireworks both nailed this in 2025. You upload a dataset, kick off a LoRA or full fine-tune through their API, and the resulting model is served on the same OpenAI-compatible endpoint your app already calls. Fireworks’ fine-tune flow is faster end-to-end; Together is cheaper and has a wider catalog of base models you can fine-tune from. Both hit production-grade quality.

The honest decision tree: prompt-tune first, RAG second, fine-tune last. Most teams jump to fine-tuning when their actual problem is a bad retrieval pipeline. If your model is hallucinating because it doesn’t know your domain, fix retrieval. If it’s getting the facts right but the format wrong, prompt-tune. If it’s the format right but a tone or behavior you can’t reach with prompting — that’s where fine-tuning earns its keep.

Open-weight model coverage

Llama 4, Qwen 3, DeepSeek V4, GLM-4.7, Mistral Large 3, gpt-oss — most providers host most of them, but not every variant on every provider, and not at every quantization.

  • Together hosts the widest catalog. If a model dropped this quarter, Together has it within a week.
  • Fireworks focuses on the most-used models with deeper DX investment per model. Less catalog breadth, more polish per endpoint.
  • DeepInfra and novita.ai are aggressive on commodity models with competitive pricing and decent uptime.
  • Hyperbolic specializes in Llama variants at the cheapest hosted price.
  • Cerebras and Groq host curated catalogs — fewer models, but the ones they have run faster than anywhere else.
  • SambaNova has the cheapest hosted 405B-class inference, period.

The Qwen + Fireworks partnership announced earlier this year means Fireworks now gets early access to closed-weight Qwen variants. That’s been quietly underrated — it’s the first time a Chinese model lab has given a Western inference provider partnership-tier access.

Self-host with Baseten, Modal, RunPod

A different category, and worth understanding clearly. Baseten, Modal, and RunPod aren’t selling you tokens against their hosted catalog — they’re selling you GPU capacity to run your own models.

Baseten is the most opinionated. You package your model, define inputs and outputs, and they handle autoscaling, cold starts, and deployment. Best DX in the category if you’re shipping a custom or fine-tuned model that doesn’t fit a managed catalog.

Modal is Python-native serverless done right. You write a Python function, decorate it, and Modal runs it on a GPU with autoscale and per-second billing. The mental model is closer to Lambda than to Kubernetes. Excellent for jobs, batch work, and bursty workloads. Less ideal for steady-state production where you’d want reserved capacity.

RunPod is the rawest of the three — closer to GPU rental than to a serving platform. Cheapest per GPU-hour by a wide margin, but you’re managing more of the stack yourself. Reasonable choice if you have ML infra people and a workload that benefits from spot pricing.

The trap most teams fall into: they pick self-host because it sounds cheaper, then spend three months on cold-start optimization, autoscaling tuning, and observability before realizing managed inference would have been a third the engineering cost for the same dollar spend. Self-host wins when you’re either (a) running a custom-trained model nobody hosts, or (b) at a scale where reserved capacity beats per-token economics. Below that threshold, just use Together or Fireworks.

DX: OpenAI-compatible endpoints, structured output, batch

Almost everyone exposes an OpenAI-compatible API now. That’s the floor. The differences are in what happens past the floor.

Structured output reliability — if your app depends on JSON mode or function calling actually working on every call, Fireworks is currently the most reliable on open-weight models. Together is solid. Some of the commodity providers will silently return malformed JSON often enough to make your downstream code defensive in ways it shouldn’t have to be.

Streaming — table stakes, but quality varies. Watch out for providers that buffer chunks server-side instead of streaming token-by-token; it kills the perceived speed win.

Batch APIs — Together’s is the gold standard. OpenAI and Anthropic have their own. If you can wait 2–24 hours for a result, the discount is roughly 50% across the board.

Observability hooks — most providers integrate cleanly with LangSmith, Helicone, Langfuse, and Phoenix. If yours doesn’t, that’s a signal.

For provider-agnostic routing, the LLM gateway layer (Portkey, OpenRouter, LiteLLM, Cloudflare AI Gateway) sits in front of all of this and lets you swap providers without code changes. I’ve covered this layer separately and it’s worth wiring up early — even if you only use one provider today, the gateway is what lets you move when pricing shifts or a provider has a bad week.

Reliability and regional residency

99.9% SLAs are claimed broadly and delivered unevenly. The custom-silicon providers have had more publicized outage windows than the major GPU platforms, partly because their fleets are smaller and a single capacity issue affects more customers. That’s improved this year but it hasn’t disappeared.

For EU-only inference, Together, Fireworks, and the hyperscalers all have EU regions. Most of the smaller providers don’t. If you have data residency requirements, this thins the field fast and pushes you toward Bedrock, Vertex, or Azure for the regulated workload.

For multi-region failover, the gateway pattern is again the answer. No single inference provider gives you global multi-region with automatic failover the way a CDN does. You build it in front.

Picks by use case

Real-time voice agent: Groq for the LLM call. The TTFT advantage is the difference between conversational and awkward.

Interactive chat product: Together or Fireworks on Llama 4 70B as the default. Add prompt caching. Route the long-tail expensive queries to Claude Opus 4.7 or GPT-5.5 through the gateway.

Agent orchestrator with long tool chains: Fireworks for structured-output reliability. Cache the system prompt. Run on Llama 4 70B for steps that don’t need frontier reasoning, and only escalate to Opus or GPT-5.5 when the planning step actually needs it.

Batch RAG indexing: Together’s batch API. Llama 4 70B for the embedding-adjacent steps, an embedding model from Cohere or Voyage for the actual vectors.

Fine-tuned vertical model: Together if you want catalog breadth and lower training cost; Fireworks if you want the best end-to-end DX.

Regulated EU workload: Bedrock or Vertex EU. The smaller providers don’t have the compliance posture, no matter what the sales rep says.

Broke startup with $0 budget: Cerebras free tier for chat-class workloads. Groq free tier for low-latency demos. Google AI Studio for anything multimodal. You can ship a real prototype on free tiers alone — just don’t build production on one.

What I’d actually do this week

If you’re starting fresh and don’t have strong existing constraints, set up Together and Fireworks accounts, wire both behind a Portkey or OpenRouter gateway, and route 80% of your traffic to whichever wins on your top three queries when you benchmark. Put Groq behind the same gateway for any latency-sensitive path. Add Cerebras when you need bursts of throughput on supported models. Skip the rest until you have a specific workload that justifies them.

The unsexy truth is that for most production AI apps in 2026, your inference layer should be boring. Two providers, one gateway, prompt caching turned on, observability wired in, batch jobs going to the cheapest tier. Save the exotic stuff for the workloads that actually need it.