How We Broke Top AI Agent Benchmarks: And What Comes Next

Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, Dawn Song

UC Berkeley

April 2026

(Est. 15-20 minutes read, tool available at UC BerkeleyApril 2026(Est. 15-20 minutes read, tool available at github.com/moogician/trustworthy-env

Our agent hacked every major one. Here’s how — and what the field needs to fix.

The Benchmark Illusion

Every week, a new AI model climbs to the top of a benchmark leaderboard. Companies cite these numbers in press releases. Investors use them to justify valuations. Engineers use them to pick which model to deploy. The implicit promise is simple: a higher score means a more capable system.

That promise is broken.

We built an automated scanning agent that systematically audited eight among the most prominent AI agent benchmarks — SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, and CAR-bench — and discovered that every single one can be exploited to achieve near-perfect scores without solving a single task. No reasoning. No capability. Just exploitation of how the score is computed.

... continue reading