Skip to main content
Logo
Overview

If your agent stack still talks directly to provider SDKs in 2026, you’re going to regret it within a quarter. Not because raw anthropic or openai clients stopped working — they’re as good as ever. The problem is everything around the call: fallbacks when Anthropic throttles you mid-incident, cost attribution across forty agents, PII redaction that keeps your legal team from filing a ticket, audit trails the EU AI Act demands, and the small matter of swapping models without a deploy.

That’s what an LLM gateway does. It sits between your application and however many providers you’ve got, and it turns “which model are we calling” into a routing decision instead of a code change. By spring 2026 the category genuinely consolidated — five products are doing the lion’s share of the production traffic, and the rest are either niche or fading.

I’ve spent the past few months running real workloads through Portkey, OpenRouter, LiteLLM, Cloudflare AI Gateway, and Kong AI Gateway. Here’s what I’d actually pick, and why.

Why every serious agent stack has a gateway now

A year ago you could get away with hitting OpenAI directly. Today you’ve got Claude Opus 4.7 for hard reasoning, Sonnet 4.6 for cheap throughput, GPT-5.5 for some workloads, Gemini 3.1 Pro for long-context jobs, and probably DeepSeek or Qwen running somewhere for batch. That’s five providers minimum, often more if you’ve embraced open weights through Together or Fireworks.

Calling them all directly means five sets of SDKs, five auth patterns, five error semantics, five rate-limit headers, and zero shared observability. When Anthropic has a regional outage — which they will — your fallback to GPT-5 happens in application code that someone wrote at 2 a.m. and nobody has tested.

A gateway centralizes all of that. One endpoint, one schema (almost always OpenAI-compatible), one place where retries, fallbacks, caching, guardrails, cost tracking, and audit live. The agent code gets simpler. The infra gets honest about what it’s doing.

The four jobs a gateway should actually handle:

  • Routing and fallback — pick the model, fail over when it’s down, weight load across providers
  • Cost and observability — token counts, $/request, $/agent, $/customer, with traces you can replay
  • Governance — PII redaction, jailbreak detection, content filters, audit logs that hold up in a compliance review
  • Caching — exact and semantic, because “rewrite this email more politely” gets asked a thousand times

Every gateway claims all four. None of them do all four equally well.

OpenRouter — the simplest possible thing

OpenRouter is the easiest sell in the category. One API key, one OpenAI-compatible endpoint, 200-plus hosted models you can swap by changing a string. You point your existing OpenAI client at https://openrouter.ai/api/v1, change the model name to anthropic/claude-opus-4-7 or google/gemini-3.1-pro, and you’re done. Onboarding takes under ten minutes.

The catch is the pricing model. OpenRouter takes a margin on top of provider rates — around 5.5% as of April 2026, though it varies by model and credit structure (check their pricing page before you commit, because they’ve shifted it twice in the past year). At $500/month of inference spend, that’s nothing. At $50,000/month, you’re handing them $2,750 every cycle for what is essentially a reverse proxy.

It’s also relatively thin on the governance side. Cost tracking exists, basic analytics exist, but PII redaction, prompt versioning, and serious guardrails are not the product. OpenRouter is a router, full stop. If you need policy enforcement, you either bolt it on yourself or look elsewhere.

I’d reach for OpenRouter when I’m prototyping, when I’m a solo dev, or when I’m running an agent that genuinely benefits from cheap experimentation across many models. I would not run a six-figure-spend production stack on it without a real conversation about whether the margin justifies the convenience.

Portkey — the production gateway that just got a lot more interesting

Portkey was already the strongest production gateway in the managed category by late 2025 — guardrails, prompt versioning, PII redaction, jailbreak detection, fallback graphs, semantic caching, the works. The thing that changed the conversation is that in March 2026 they open-sourced the gateway core under Apache 2.0.

That matters more than it sounds. You can now self-host the actual proxy in your VPC, keep all your traffic inside your perimeter, and pay Portkey only for the managed control plane (analytics, prompt library, evals, the dashboard). For regulated industries — finance, healthcare, anyone in the path of HIPAA or the EU AI Act — that’s the difference between “Portkey is on the procurement deny-list” and “yes, we’re shipping with this.”

The feature set is wide. Guardrails are configurable per-route. Prompt versioning ships with A/B routing and rollback. The fallback configuration is actually expressive — you can do conditional fallbacks based on error type, weighted load balancing, semantic similarity caching, the lot. And the OpenAI-compatible API means migrating from raw SDKs is a sed command.

The downsides: the managed control plane has real per-seat and per-request pricing, the documentation is dense, and the configuration surface is large enough that you’ll want someone on the team who actually owns it. This is not a “set it and forget it” gateway. It’s a platform.

Portkey is my default pick for any team running production agents at scale who cares about governance. Especially now that the gateway is Apache 2.0.

LiteLLM — the cost winner, with conditions

LiteLLM is the open-source Python proxy that sits in front of 100-plus providers and exposes them all as an OpenAI-compatible API. You self-host it. You run it. You pay zero margin to anyone — your inference cost is the provider’s actual rate, full stop.

For a team spending $10,000+/month on inference, the math gets very loud. At $50K/month, the margin you’d pay OpenRouter or the platform fee you’d pay a managed Portkey tier roughly covers a whole DevOps engineer’s time keeping LiteLLM healthy. At $200K/month, it’s not even close — LiteLLM saves you real money that goes back into the model bill.

What you get with that savings: routing, fallback logic, virtual keys for cost allocation per team or customer, budget enforcement, exact and semantic caching, and decent integrations with Langfuse, Datadog, and most observability stacks. What you don’t get: the polished UI, the prompt versioning UX, the curated guardrail library, or any of the managed-service comfort.

The honest tradeoff: LiteLLM is a great tool that demands DevOps capacity. If your team can run a Python service in production — Postgres for state, Redis for cache, Helm chart in your cluster — you’ll be fine, and you’ll be fine cheaply. If your team currently outsources infra to a PaaS, the operational cost will eat the savings. Know which team you are before you commit.

Cloudflare AI Gateway — the obvious pick if you’re already there

Cloudflare AI Gateway exited beta earlier in 2026 and is now the default LLM gateway for anyone whose stack already runs on Cloudflare. Free tier exists, edge-cached responses are genuinely fast, and the integration with Workers, Hyperdrive, Vectorize, and R2 is the tightest you’ll find anywhere.

The pitch is straightforward: if your inference traffic already passes through Cloudflare’s edge, putting the gateway there means zero added latency, automatic logging, response caching for free, and a unified bill. Their analytics dashboard is genuinely good — request volume, error rates, cached vs uncached, broken down by provider — and the Workers integration means you can add custom routing logic in JavaScript without standing up a separate service.

Where it lags: governance and guardrails are thinner than Portkey, prompt management is basically not a feature, and the model-selection UX is more “configure your provider keys” than “browse a catalog.” It’s a competent infrastructure layer, not a product platform.

For a Cloudflare shop, this is a non-decision — you should be using it, and the only question is whether to layer Portkey or LiteLLM on top of it for governance. For everyone else, it’s not worth migrating your stack to Cloudflare just to get the gateway.

Kong AI Gateway — the enterprise answer

Kong AI Gateway is what happens when an established API-management vendor extends its existing product into the AI lane. You get semantic caching, RBAC, governance plugins, rate limiting that actually works under load, and the ability to manage AI traffic alongside the rest of your service mesh in the Kong control plane you’ve already paid for.

If your organization already runs Kong — which a startling number of large enterprises do, often without any individual team realizing it — adding the AI Gateway plugin set is the path of least resistance. You don’t introduce a new vendor, you don’t fight procurement, and the security and audit story is identical to the one you’ve already approved for the rest of your traffic.

If you’re not already on Kong, this is not the entry point. The product is excellent inside its world and deeply unmotivating outside it. The license cost alone makes it a non-starter for teams that aren’t already in the Kong universe, and the developer experience is correspondingly enterprise-flavored — capable, configurable, not delightful.

Honorable mentions

A few others come up in serious conversations and deserve a sentence each.

Helicone is observability-first — a passthrough proxy that does great cost tracking and request inspection but is not really trying to be a routing or governance layer. Pair it with raw SDKs or another gateway, don’t use it as your only one.

OpenPipe is fine-tune-aware routing — they want to route your easy requests to a smaller fine-tuned model and your hard requests to GPT-5.5 or Claude. Genuinely interesting if you’re at the scale where fine-tuning pays for itself.

TrueFoundry, Eden AI, and llmgateway.io are all credible smaller plays. None of them have hit the install base or feature parity to displace the five above, but they’re worth a look if your specific requirements (e.g., on-prem-only, particular compliance posture) rule out the leaders.

AWS Bedrock is the “native” option if your stack lives entirely inside AWS. It’s a perfectly reasonable choice for an AWS-only shop with first-party model access. It’s not a gateway in the same sense as the others — it’s a model marketplace with some routing — and it doesn’t help you if you also need to call OpenAI or Gemini.

True monthly cost across spend tiers

Sticker prices are misleading. What actually matters is total cost: provider bill plus gateway margin or fees plus DevOps overhead.

At $1K/month of inference, the gateway is rounding error. Use OpenRouter, Cloudflare AI Gateway, or Portkey’s free tier — the difference is twenty bucks, pick the developer experience you like.

At $10K/month, OpenRouter’s margin starts to be real money (roughly $550/mo). Portkey’s managed plan is competitive. LiteLLM saves you most of that margin if you can absorb the ops cost — call it a half-day per week of someone’s time.

At $100K/month, the math is loud. OpenRouter’s margin alone is north of $5K/mo. LiteLLM self-hosted plus a managed Langfuse instance for observability will run you under $1K/mo in actual infra plus an engineer’s part-time attention. Portkey self-hosted (Apache 2.0 core) plus their managed control plane lands somewhere in the middle, with substantially better governance than rolling your own.

The DevOps cost is the wildcard. Teams underestimate it consistently — running LiteLLM well in production is more work than the README suggests. Be honest about your team’s capacity before you optimize for sticker price.

A decision tree that actually works

For a solo developer or early-stage prototype: OpenRouter. Stop overthinking it. You can migrate later.

For a startup with a small platform team: Portkey managed. The governance features will save you a quarter of work the first time you have to answer a security questionnaire.

For a scale-up north of $25K/month inference spend with DevOps capacity: LiteLLM self-hosted, Langfuse for traces, Cloudflare or your own CDN in front of it for edge caching.

For a Cloudflare-native shop: Cloudflare AI Gateway, with Portkey or your own guardrail layer added on top if compliance demands it.

For an enterprise already running Kong: Kong AI Gateway, full stop. Don’t introduce a new vendor for this.

The pattern that keeps emerging across all these stacks, regardless of which gateway you pick, is a three-layer architecture: gateway for routing and policy, evals for quality, observability for traces. Picking the gateway is the first decision. The other two get layered on top, and the gateway you pick will shape which observability and eval tools fit cleanly.

Migrating off raw SDKs without breaking production

If you’re sitting on direct provider SDK calls and reading this thinking “yes but our codebase is full of from anthropic import Anthropic,” the migration is less scary than it looks. Every gateway here exposes an OpenAI-compatible chat-completions endpoint. The mechanical change is two lines: swap the base URL, swap the API key.

The real work is on the routing config side. Decide your fallback chains, your timeout budget, your retry policy, and your caching rules before you cut traffic over. Run the gateway in shadow mode for a week — duplicate traffic, compare outputs, watch the cost dashboard — and only then flip the live path. Two weeks is plenty for a team that hasn’t done this before. One week if you’re disciplined.

The thing nobody tells you: the hardest part isn’t the migration, it’s resisting the urge to keep adding gateway features once you have one. Start with routing and fallbacks. Add cost attribution next. Layer in guardrails when legal asks. Don’t try to launch with all of it on day one.

If you’ve been putting this off, pick the smallest production agent you’ve got, point it at OpenRouter or Portkey for an afternoon, and watch a single week of cost data come in. Most teams do that and immediately want the gateway in front of everything else.