Through the looking glass of benchmark hacking

Monday morning at Poolside started with a curious discovery - one of the RL training runs for our Laguna M.1 model had leapt 20% over the weekend on SWEBench-Pro to ~64%, which would place it at #1 on the leaderboard over much bigger and more mature models.

This sudden performance jump, not reproduced in other benchmarks, made us immediately suspicious of a reward hack.

Aleksei Yesterday at 10:29 AM 👮 we need reward hacking police or RL will soon achieve 100% quality on SWE-Bench Pro

The root exploit was easy to find and fix; task images contained an unpruned git history that the agent was able to mine to find the reference solution.

But even after hardening the task images with the fix, sample review revealed that this was the shallowest of several layers of reward hacks, the deepest of which were impossible to solve by patching benchmarks alone. This was not localized to our model; we found instances of similar hacks in other popular agents and models.

The same tools and skills that make agents so capable—particularly terminal use and web search—also make it hard to stop a highly intelligent agent that wants to cheat; or more specifically has not been sufficiently instructed and aligned on what constitutes cheating.

Once the action space is large enough, guarding against this becomes less a matter of locking down the environment and more about steering the agent through clearer instructions and reward penalties for misalignment. Outcome based reward alone ceases to be a sufficient metric — we need to take into account the process to obtain it.

As RL pushes models to be more exploratory and better tooled, accounting for misaligned behavior when looking at eval results becomes paramount. We need to level up our benchmarking strategies to keep up - sharper task specifications, metrics beyond pass rate and a continual process of sample review and reward hack discovery.

In this post, we outline some of the reward hacks we’ve encountered and what strategies we are exploring to resolve them.

Hack one: Mining local git history

... continue reading