How We Broke Top AI Agent Benchmarks: And What Comes Next
Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, Dawn Song
UC Berkeley
April 2026
(Est. 15-20 minutes read, tool available at UC BerkeleyApril 2026(Est. 15-20 minutes read, tool available at github.com/moogician/trustworthy-env
Our agent hacked every major one. Here’s how — and what the field needs to fix.
The Benchmark Illusion
Every week, a new AI model climbs to the top of a benchmark leaderboard. Companies cite these numbers in press releases. Investors use them to justify valuations. Engineers use them to pick which model to deploy. The implicit promise is simple: a higher score means a more capable system.
That promise is broken.
We built an automated scanning agent that systematically audited eight among the most prominent AI agent benchmarks — SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, and CAR-bench — and discovered that every single one can be exploited to achieve near-perfect scores without solving a single task. No reasoning. No capability. Just exploitation of how the score is computed.
... continue reading