If you shipped an agent in 2024, you probably wired up traces in your spare time and shrugged at evals. That stopped being acceptable about a year ago. Silent regressions, $4,000 prompt-loop incidents, agents that quietly stopped calling their tools after a model upgrade — every team I talk to has the same scar tissue and the same question: which observability platform are you on?
By Q2 2026 the answer has narrowed. There are five tools that show up in real production stacks, and the choice between them is now load-bearing infrastructure. Pick wrong and you’re either re-instrumenting in eighteen months or paying a six-figure bill for tracing data you can’t query.
This is the eval-and-observability layer of the three-layer pattern I wrote about yesterday in the LLM gateway piece. The gateway is the front door. Observability is what tells you the front door is on fire.
Why this stopped being optional
Three things converged. Agents went multi-step (a single user request now spans a planner, four tool calls, two sub-agents, and a synthesis call — and you cannot debug that without distributed-trace-style spans). Model providers ship updates weekly (Claude Opus 4.7 dropped in April and a fraction of prompts that worked perfectly on 4.6 silently degraded). And regulators got specific.
The EU AI Act’s high-risk-system requirements kick in August 2 2026. If your agent does anything labeled high-risk (recruitment, credit, critical infra, education, biometric ID, justice) you need an audit trail covering inputs, outputs, intermediate steps, and the model versions that produced them. Colorado’s AI Act applies starting February 2026 with similar audit expectations for “consequential decisions.” The audit obligation doesn’t say “use a vendor,” but it does say “produce, on demand, a reproducible record of how the system reached this decision.” That’s what observability platforms sell.
Add the cost angle — agents burn tokens unpredictably, and a $0.03 average request can become a $14 outlier the moment a tool returns a 50K-token error message and the agent decides to retry. Without per-trace cost attribution you find out from the monthly invoice.
So: not optional. Now the question is which one.
The five that matter
The honest cut: Langfuse, LangSmith, Braintrust, Arize Phoenix, and Helicone. Everything else is either a tier down (Lunary, Galileo, OpenLLMetry as a standard, W&B Weave for ML-shop overlap) or a generalist observability platform stretching into LLM (Datadog LLM Observability, New Relic AI Monitoring, Honeycomb’s LLM features) that I’d only pick if you’re already deeply on that vendor.
Here’s how I’d bucket them at a glance:
- LangSmith — the LangChain/LangGraph-native default. Deepest tracing for graph-based agents. Premium pricing.
- Langfuse — open-source under MIT, self-hostable, 3.0 shipped February 2026 with proper agent-trace UI. The cost winner above ~$2K/month of tracing.
- Braintrust — eval-first. The platform that prompt engineers at OpenAI and Anthropic actually use. Lighter on production tracing.
- Arize Phoenix + AX — open-source Phoenix for tracing/eval, paid AX for production observability and drift. Best if you also have classical ML running.
- Helicone — proxy-style. Drop-in for any OpenAI-compatible API. Generous free tier. Five-minute install.
LangSmith: the path of least resistance for LangChain shops
If your stack is already LangChain or LangGraph, LangSmith is the obvious pick and you can stop reading. The integration is one environment variable. Traces show graph topology natively, which matters more than people realize when your agent has nested sub-graphs and you’re trying to figure out which node is hanging.
The downsides are pricing and lock-in. LangSmith’s pricing changed in early 2026 to a tiered model with per-seat fees, dataset row charges, and trace retention tiers. For a team of ten with serious traffic you’re looking at low five figures annually before you negotiate, and dataset rows for evals get expensive fast. The lock-in is softer — they support OpenTelemetry export now — but the deepest features (graph visualizations, dataset experiments tied to prompts) are proprietary.
When LangSmith is the right call: you’re committed to LangGraph, your team values “it just works” over cost optimization, and you’d rather pay than run another stateful service.
Langfuse: the OSS pick that grew up
I’d bet most platform engineers I respect are running Langfuse in 2026. The 3.0 release in February rebuilt the agent-trace view, shipped a real prompt-management UI with versioning and A/B routing, and added dataset experiments that actually compete with Braintrust feature-for-feature. The license is MIT. The whole stack runs on Postgres plus optional ClickHouse for traces at scale.
The thing that puts it over the top for me: self-hosting Langfuse genuinely works. I’ve seen teams run it on a single 4-vCPU box for months handling tens of thousands of traces a day. Compare that to self-hosting Arize Phoenix at production scale (doable but more moving parts) or LangSmith (not an option).
The trade-off is operational ownership. You’re running another service. Backups, upgrades, scaling ClickHouse when traces blow past Postgres-comfortable volumes. If your team doesn’t have someone who’d enjoy that, Langfuse Cloud exists at competitive pricing — but at that point Helicone is cheaper for pure observability and Braintrust is better for evals.
The killer feature people sleep on: Langfuse’s prompt management lets you deploy a prompt change without a code deploy. Versioned, with rollback. For teams iterating on agent prompts daily, this alone justifies the migration.
Braintrust: where serious eval work lives
Braintrust is a different animal. It’s not really an observability tool — it’s an eval platform that happens to do tracing. The distinction matters. If your team’s bottleneck is “we ship prompt changes and pray,” Braintrust is the answer. If your bottleneck is “production agents are doing weird things and we can’t see why,” it’s the wrong tool.
The eval primitives are the cleanest in the category. Datasets, scorers (built-in, custom, LLM-as-judge), experiments that diff two prompt versions or two models on the same dataset, and a UI where prompt engineers can tweak and re-run without writing code. Anthropic and OpenAI’s prompt-engineering teams use it for a reason — when your job is making a prompt 4% better, the iteration loop matters more than anything else.
Production tracing exists but is thinner than Langfuse or LangSmith. You won’t get the same depth of agent-step debugging. The pricing is enterprise-flavored (talk to sales for anything serious), which puts it out of reach for solo devs and small teams.
The pattern I keep seeing: teams pair Braintrust with something else. Braintrust for eval-driven development, Langfuse or Helicone for production tracing. Yes, two tools. Yes, that’s annoying. The current state of the market.
Arize Phoenix: the ML-shop choice
Phoenix is open-source from Arize AI. If you already have classical ML in production and you’re adding LLM workloads, Phoenix unifies the two — the same dashboards show your XGBoost model’s drift and your RAG pipeline’s retrieval quality. Nobody else does this well.
Phoenix-the-OSS project ships a solid notebook-and-local experience and OpenTelemetry-native traces. For production, you’ll typically want Arize AX (the paid platform) for retention, alerting, drift monitoring, and the operational features you’d expect. AX is priced for the enterprise ML buyer, not the indie-hacker LLM developer.
The honest take: if you’re an LLM-only team, Phoenix is fine but not the obvious winner against Langfuse for OSS or LangSmith for managed. If you have an ML platform team that already speaks Arize, adding LLM workloads to AX is the path of least resistance and the unified-platform pitch is real.
Helicone: the five-minute install that scales further than you’d think
Helicone is a proxy. You change your OpenAI base URL to point at Helicone, add your API key as a header, and you have logging. That’s it. It works with anything OpenAI-compatible, which in 2026 means basically every model provider including Anthropic via their compat endpoint.
The free tier is generous (10K logs/month last I checked — verify current limits before committing). Paid plans are cheap relative to the category. Sessions and traces are supported, custom properties for filtering, prompt experiments, and an actually-useful playground.
Where Helicone fits: solo devs, small teams, side projects, and the early stages of any product where you want observability now and don’t want to think about it. Where it stops fitting: deep agent debugging where you need to see tool calls as structured spans rather than logged messages, and serious eval workflows where Braintrust or Langfuse pull ahead.
The proxy model has a downside worth naming: every request goes through Helicone before reaching your model. They publish low p99 latency numbers, but you’ve added a network hop and a vendor in your hot path. If you can’t tolerate that, the SDK-based tools (everyone else on this list) trace asynchronously without affecting request latency.
OpenTelemetry vs proprietary: the question that decides your next migration
Here’s the thing nobody puts on their pricing page. The agent ecosystem is converging on OpenTelemetry GenAI semantic conventions as the trace format. OpenLLMetry (the project, not a tool) defines spans for llm.completion, tool.call, agent.step, and so on. Tools that emit and ingest OTel-compliant traces are interchangeable. Tools that don’t will lock you into their proprietary format, and the next migration will involve re-instrumenting your code.
As of April 2026:
- Helicone — OTel-friendly, can export.
- Langfuse — supports OTel ingestion (added in 3.0), proprietary SDK still preferred for full feature parity.
- Arize Phoenix — OTel-native from day one, the cleanest story here.
- LangSmith — proprietary SDK with OTel export added recently.
- Braintrust — proprietary first, OTel support is partial.
If portability matters to you, weight Phoenix and Langfuse higher. If you’re not planning to leave anyway, this is noise.
True cost at three scales
Sticker prices lie because seat counts, dataset rows, retention windows, and overage hide the real bill. Rough order-of-magnitude estimates from teams I’ve talked to (verify current pricing before committing — every vendor on this list has changed pricing in the last six months):
~$1K/month inference spend (small startup, one product)
- Helicone free tier or low-paid plan: ~$0–$50/mo
- Langfuse Cloud Hobby or self-hosted: ~$0–$30/mo
- LangSmith Plus: starts around $39/seat/mo, gets expensive with seats
- Braintrust: usually overkill at this scale
- Phoenix self-hosted: free, your time is the cost
~$10K/month inference spend (Series A, real production traffic)
- Langfuse self-hosted: ~$100–300/mo infrastructure, plus eng time
- Helicone Pro: low hundreds/mo
- LangSmith: low-to-mid four figures/mo with a small team
- Braintrust: similar four figures/mo, eval-heavy use
- Arize AX: enterprise pricing, expect a sales call
~$100K/month inference spend (Series B+, agents in critical path)
- Langfuse self-hosted with dedicated infra: ~$1–3K/mo, plus a half-FTE
- LangSmith Enterprise: mid-to-high five figures annually, negotiable
- Braintrust Enterprise: similar territory
- Arize AX: comparable to LangSmith Enterprise
- Helicone: still surprisingly competitive if you only need observability
The crossover point where self-hosted Langfuse beats LangSmith on raw cost is around $2K/month of LangSmith spend. The crossover where it makes sense operationally is “you have someone who’d run it.” Those aren’t the same number.
EU AI Act and the audit-trail question
If you’re touching anything that could be classified high-risk under the EU AI Act (full list in Annex III, but the categories most likely to bite SaaS builders: recruitment, credit scoring, education, essential services), you need an audit trail by August 2 2026. Concretely:
- Inputs and outputs preserved with timestamps
- Model versions identifiable per request
- Intermediate reasoning steps for agent systems
- Retention periods sufficient for regulatory inquiry (interpreted as multi-year by most counsel I’ve seen)
- PII handling that doesn’t itself create GDPR exposure
What this means in tool-selection terms: long retention costs money, and the cheap tiers of every platform on this list cap retention at 30 days or less. Budget for the retention tier you’ll actually need. Self-hosted Langfuse on your own storage is the cheapest path to long retention if you can run it. LangSmith and Braintrust ship enterprise-grade audit features but you’ll pay enterprise-grade prices.
PII scrubbing matters too. Langfuse, LangSmith, and Helicone all support before-send redaction. Test it against your real prompts before you trust it — regex-based scrubbers miss surprising things in agent contexts.
How I’d actually pick
The decision tree I keep coming back to:
You’re on LangChain/LangGraph and value time over money. LangSmith. Stop overthinking it.
You want OSS, you have someone who can run a service, and cost matters at scale. Langfuse self-hosted. The 3.0 release made this the default recommendation for most platform teams I talk to.
Your bottleneck is prompt iteration, not production debugging. Braintrust, paired with Helicone or Langfuse for production tracing. Two tools, but the eval velocity is worth it.
You have an ML platform team and Arize is already in the building. Arize AX with Phoenix for dev. Don’t fight the org.
You’re a solo dev or small team and want it working today. Helicone. Revisit in six months when you’re feeling the limits.
The pattern that’s emerging in serious agent shops: gateway in front (Portkey, OpenRouter, LiteLLM), observability + evals in the middle (one of these five), and an agent framework on top (LangGraph, CrewAI, Claude Agent SDK). The three layers connect via OpenTelemetry, and the better your OTel hygiene the more replaceable each layer becomes.
If you’ve been putting this off, the cheapest experiment is dropping Helicone in front of your existing stack this afternoon and seeing what your traces look like for a week. If you already know you need real eval workflows, install Langfuse self-hosted on a $20/month box and compare it to a Braintrust trial. The wrong move is waiting until the August deadline forces the decision under pressure.