AI Agent Benchmarks Are Broken

Benchmarks are foundational to evaluating the strengths and limitations of AI systems, guiding both research and industry development. As AI agents move from research demos to mission-critical applications, researchers and practitioners are building benchmarks to evaluate their capabilities and limitations. These AI agent benchmarks are significantly more complex than traditional AI benchmarks in task formulation (e.g., often requiring a simulator of realistic scenarios) and evaluation (e.g., no gold label), requiring greater effort to ensure their reliability.

Unfortunately, many current AI agent benchmarks are far from reliable. Consider WebArena, a benchmark used by OpenAI and others to evaluate AI agents on interactions with websites. In a task to calculate the duration of a route, an agent answered “45 + 8 minutes” and was marked correct by WebArena, although the correct answer is “63 minutes.” Moreover, among 10 popular AI agent benchmarks (e.g., SWE-bench, OSWorld, KernelBench, etc.), we found severe issues in 8 of them, causing in some cases up to 100% misestimation of agents’ capabilities.

These numbers make one thing clear: to understand an agent’s true abilities, we must build AI agent benchmarks in a more rigorous way.

How do we build AI agent benchmarks we can trust? In our recent work, we break down the failure modes in current AI agent benchmarks and introduce a checklist that minimizes the gamability of AI agent benchmarks and ensures they measure what they claim to measure. In future posts, we will provide recommendations for creating AI agent benchmarks we can trust and deep dives on specific benchmarks!

How do Current AI Agent Benchmarks Fail?

Operational and conceptual processes of AI agent evaluation. Task and outcome validity are essential to ensure that benchmark results truly reflect agents’ capabilities.

In AI agent benchmarks, agents are asked to complete tasks end-to-end, such as fixing a code issue in a large repository or creating a travel plan.

This ambitious scope creates two challenges that traditional AI benchmarks rarely face:

Fragile simulators: Tasks often run inside simulated/containerized websites, computers, or databases. If these mini-worlds are buggy or outdated, an agent can simply find a shortcut to pass or find the task impossible. No easy “gold” answer: Task solutions may be code, API calls, or paragraph-long plans, which don’t fit a fixed answer key.

Given these challenges, we propose two validity criteria that are particularly important for AI agent benchmarks:

... continue reading