Skip to main content
Logo
Overview

AI Eval and Red-Teaming 2026: After Promptfoo Joined OpenAI

May 3, 2026
12 min read

The day OpenAI bought Promptfoo was the day a chunk of every AI platform team’s roadmap turned into a meeting. March 9, 2026. The OSS license stays open, the maintainers stay on, but the commercial roadmap for Promptfoo Cloud quietly slid into OpenAI’s Frontier track. If you’re a Claude shop running Promptfoo in CI, your eval vendor is now your model vendor’s parent company. That’s not a disaster, but it’s not nothing either.

I’ve spent the last two months helping three teams revisit their eval-and-red-team stack post-acquisition — one Anthropic-heavy, one regulated EU, one Microsoft-stack. None of them landed on the same answer, and that’s the actual story of this category in 2026. There is no single best tool. There’s a buyer archetype, a model allegiance, a regulator breathing down your neck, and a six-figure budget that has to land somewhere.

Here’s how I’d break the field down right now.

Why eval and red-team became their own line item

For most of 2024 and 2025, “we vibes-checked it in staging” was a defensible answer to a CISO. By 2026 that stopped working, for three reasons that compounded fast.

Agentic systems started taking real actions — pulling money, drafting filings, sending emails — and the failure mode stopped being “wrong answer” and started being “wrong action with downstream cost.” The OWASP LLM Top 10 v2 expanded specifically to cover agent tool-misuse and goal-hijack scenarios that didn’t even have names eighteen months ago. And the EU AI Act high-risk-system rules land August 2, 2026 with Annex IV technical-documentation requirements that auditors will actually check.

That last one is what turned this from an engineering-team decision into a procurement decision. If your AI agent touches anything in the high-risk-system list — credit, employment, education, critical infrastructure, justice — you need an audit trail your conformity-assessment auditor will sign off on. Spreadsheets of prompts you ran by hand don’t qualify.

So the question stopped being “should we eval our LLM app” and became “which vendor’s evidence will my auditor accept, will survive my model swap, and won’t get bought by my model vendor.”

The seven that actually matter

There are dozens of tools in this space if you count every observability platform that bolted on an evals tab. Realistically, seven get serious consideration in 2026 procurement. I’ll go in order of how often I see them on shortlists.

Promptfoo, post-acquisition

The OSS project is still the lowest-friction way to get evals into a CI pipeline. The GitHub Action works, the YAML config is genuinely clean, the assertion library covers most of what you need. None of that changed.

What changed is the unspoken implication of running it. If you’re a Claude or Gemini shop, you’re now sending your eval datasets — which often include your most carefully curated production prompts — through tooling owned by your model’s biggest competitor. The license technically prohibits training on your data, the maintainers are reputable, and OpenAI hasn’t given anyone a reason to distrust them on this. But that’s not how procurement risk works. Risk is about what could go wrong, not what has gone wrong.

My take: keep using OSS Promptfoo if you’re already on it and you’re an OpenAI shop or model-agnostic. If you’re a Claude shop with anything sensitive in your eval set, move to Patronus or Giskard before your CISO asks you to. Don’t migrate to Promptfoo Cloud unless you were already going to be an OpenAI Frontier customer anyway.

Patronus AI

Patronus has quietly become the obvious enterprise replacement for teams that don’t want OpenAI as their eval vendor. They ship Lynx for hallucination detection, HaluBench and FinanceBench and EnterpriseBench as published benchmarks you can compare against, and a managed eval service that’s been winning Anthropic-heavy customers all of Q2 2026.

Pricing is seat-based plus usage tiers, typically landing in the $50k–$150k/year range for a serious team. That’s expensive next to OSS Promptfoo, but if you compare it to the cost of one engineer maintaining your eval infra full-time, it’s not even close.

The thing I like about Patronus is that the founders came from Meta’s responsible-AI org and the product feels like it. Detectors are scoped narrowly, false-positive rates are published, and the evaluator models themselves get re-tested and updated. That sounds boring. It’s the opposite of boring when your auditor asks how you know your eval is still calibrated.

Lakera, now Check Point

Lakera got acquired by Check Point in 2025 and bundled into the Infinity Platform alongside CloudGuard WAF. Lakera Guard handles runtime prompt-injection and PII filtering; Lakera Red handles pre-deployment red-teaming. The combined pitch lands well with security-team-led buyers — one vendor for AI guardrails plus WAF plus zero-trust, all on a Check Point bill.

If your CISO already writes Check Point checks, this is the easy answer. If they don’t, the bundling pitch turns into a procurement headache. Lakera-only deals still happen, but the post-acquisition pricing has drifted toward enterprise-sized commitments.

The runtime guardrail is the strongest part of the product. Lakera’s prompt-injection detection has been the academic benchmark to beat for over a year, and that hasn’t changed under Check Point ownership. Where I’d push back is on the bundling story — if you don’t need WAF, you shouldn’t be paying for the integration.

Giskard

Giskard is the European answer, and that matters more in 2026 than it did a year ago. The OSS testing library covers ML and LLM workloads with a pytest plugin that drops into existing CI without much fuss. Giskard Hub is the managed tier with collaboration, dashboards, and — the part that closes deals right now — explicit support for EU AI Act conformity-assessment templates.

If you’re a European company shipping AI into the high-risk-system list, Giskard is the path of least regulator-resistance. Your auditor has probably already heard of them. Their docs map cleanly onto Annex IV evidence requirements. The GDPR posture is unambiguous because the company is in Paris and the data plane is too.

Outside Europe, Giskard is still a strong open-source pick, especially for ML-plus-LLM teams who don’t want two separate testing tools. It’s just that the EU narrative is what’s accelerating their growth in 2026.

Humanloop

Humanloop is the one I keep recommending to teams where prompt iteration is a cross-functional sport. Product managers, domain experts, support leads — anyone who needs to read a prompt and propose a change without filing a PR. The collaborative prompt-management surface is the actual product. Evals are a feature that comes along for the ride.

That framing matters because it’s also the limit. If your bottleneck is engineers needing better CI evals, Humanloop is overkill on the collaboration side and underweight on the security side. If your bottleneck is the legal-or-clinical-or-domain-expert review loop being a bottleneck, it’s the only tool in this list that even tries to solve it.

Pricing skews enterprise but they have a serviceable mid-market tier. Don’t expect them to compete with Patronus on pure detector quality, and don’t expect Patronus to compete with them on review workflows.

Robust Intelligence, now Cisco

Cisco bought Robust Intelligence in 2024 and rebuilt it as the AI Firewall inside Cisco AI Defense. If you’re already a Cisco-stack shop, this is essentially free strategic alignment — the tool sits inside the same security blast radius as the rest of your network controls, and the procurement story writes itself.

If you’re not a Cisco shop, the integration story turns into a tax. The Foundation Model evals are solid, but you can get equivalent eval coverage from Patronus or Giskard without buying a Cisco posture you didn’t want. ACVs land north of $150k for serious deployments.

I’d describe Robust Intelligence in 2026 as a category-defining tool that became a Cisco SKU. The technology is good. The fit depends entirely on whether you want Cisco running your AI security plane.

Microsoft PyRIT

PyRIT is Microsoft’s open-source Python framework for adversarial probing — orchestration, attack patterns, scoring rubrics. It’s not a managed product, it’s a toolkit. If your team has the engineering hours to write red-team campaigns programmatically, it’s the most flexible option here, and it integrates cleanly with Azure AI Studio, Defender for Cloud, and Purview.

Microsoft shops with internal AI red teams use PyRIT the way ML teams use scikit-learn — as the substrate they build their own work on. Smaller teams should not pick PyRIT thinking it’ll save them money. The licensing is free; the engineering cost is not.

PyRIT 2.0 added agentic-test support that’s worth a look if you’re trying to red-team tool-use scenarios. Most prompt-level red-team tools still treat agent-level attacks as a future feature.

Honorable mentions worth knowing

NVIDIA Garak is a CLI vulnerability scanner that pairs well with NeMo Guardrails. Confident AI’s DeepEval and DeepTeam are the lowest-friction OSS for dev teams who like pytest-style assertions and want OWASP LLM Top 10 coverage out of the box. HiddenLayer focuses on the broader ML-pipeline threat model. Mindgard runs continuous adversarial testing as a service. Galileo, Arize Phoenix, Braintrust, and LangSmith all ship eval features as part of their observability platforms — convenient if you’re already on one of them, not enough alone if eval is your primary need. Arthur, Credo AI, Holistic AI, and Vijil are governance-platform plays bought by GRC, not engineering.

The split nobody talks about: eval versus red-team

This is the architecture point that makes the most difference and that almost no vendor blog will tell you straight.

Eval is regression detection. You have a golden dataset, you run it on every change, and you watch for the metric to drop. The buyer is an engineer trying not to ship worse outputs. The cadence is per-PR.

Red-team is adversarial probing. You have a threat model, you generate attacks, and you watch for vulnerabilities. The buyer is a security lead trying not to ship exploitable systems. The cadence is per-release plus continuous.

These are different products with different buyers and different success metrics. Tools that try to be both — most of them — usually do one well and the other adequately. Promptfoo is eval-first with red-team bolted on. Lakera Red is red-team-first. Patronus straddles both with separate product lines. Giskard does both with a stronger eval pedigree. PyRIT is red-team-only.

If you’re buying one tool to do both jobs, ask which of the two roles it was actually built for. The answer is almost always one or the other.

Pricing in plain numbers

Roughly, as of April 2026 — get the actual quote because procurement-led discounts on enterprise deals are real. OSS Promptfoo, Giskard, PyRIT, Garak, and DeepEval cost zero in licensing and your engineering time in maintenance. Patronus and Humanloop run $50k–$150k/year for a typical deployment. Lakera Guard prices on token throughput. Robust Intelligence ACVs start around $150k and climb. Vendor-led structured assessments — the kind your auditor wants annually — are $40k–$150k a pop. A continuous-testing managed platform usually lands $50k–$250k/year all-in.

If those numbers shock you, the alternative is a senior engineer spending half their time on eval infra. Run the math.

Agent red-teaming is the actual hard problem

Prompt-level red-teaming — jailbreaks, injections, content filtering — is mostly a solved problem in the sense that every named tool covers the basics. Agent-level red-teaming is where the field is still figuring itself out.

Tool misuse: did the agent call the wrong tool, or call the right tool with manipulated arguments? Goal hijack: did a malicious input redirect the agent’s planning loop? Plugin-chain exfiltration: did the model leak data through the third tool in a chain that nobody audited? Compounding errors: did a small misstep in step two cascade into a serious failure by step seven?

Almost no eval tool has good answers for these yet. PyRIT 2.0 has primitives. Lakera Red has scenarios. Patronus has agent-action evaluators. Giskard added agent-test support in Q1. The category is moving fast and you should expect whichever tool you pick to look different in twelve months.

This is also where the Promptfoo acquisition matters most. Agent-level red-teaming is exactly where OpenAI’s internal frontier work is strongest, and exactly where third-party tooling has been weakest. The deal makes more sense from that angle than from any “we wanted YAML eval configs” angle.

A decision tree I actually use

Pre-PMF startup with no money: DeepEval or OSS Promptfoo, full stop, you can revisit later.

Series A SaaS shipping fast: OSS Promptfoo if you’re OpenAI-aligned, Humanloop if your iteration is cross-functional.

Claude shop wary of the OpenAI ownership question: Patronus or Giskard.

Security-team-led enterprise on the Check Point stack: Lakera plus Check Point bundle.

Security-team-led enterprise on the Cisco stack: Robust Intelligence plus AI Defense.

Regulated industry, EU AI Act high-risk system: Giskard plus Patronus plus a governance platform like Credo. Three tools, three buyers, one audit trail.

Microsoft-stack shop: PyRIT for red-team, Azure AI Studio evals for regression, Defender for Cloud for runtime.

Heavy ML pipelines plus LLM features: HiddenLayer or Giskard.

Where this fits in the rest of the agent-infra stack

The reference architecture I’ve been writing toward across the last few weeks has six layers: gateway, observability, governance, vector retrieval, memory, and now eval plus red-team. Each of those is an independent buying decision with its own vendor archetype. None of the vendors trying to own all six are doing more than two well. Pick the eval-and-red-team layer based on the model allegiance and regulator pressure that’s specific to you, not based on which suite your gateway vendor happens to also sell.

If you do nothing else this quarter: build a small golden eval dataset by hand from your real production traffic, get it running in CI on every prompt change, and run a single OWASP LLM Top 10 baseline scan against your agent. That’s the floor. Whichever vendor you pick later is a question of how much further past the floor you can afford to go.