Show HN: I benchmarked LLM agents on fixing real-world security vulnerabilities

~15 min read

Revisions 2026-06-01 — Post rewritten for improved structure and storytelling. All numbers and statistical conclusions are unchanged. 2026-05-28 — Five security tests were found to reject valid alternative fixes that nonetheless addressed the reported vulnerability. Results were recalculated after correcting the tests. Solve rates increased by 3–7 points per model; the ranking order is unchanged, but cross-family pairwise comparisons that previously fell short of significance now cross α = 0.05 under McNemar with continuity correction. All affected numbers and statistical conclusions have been updated.

TL;DR — I evaluated five frontier models (gpt-5.5, gpt-5.4-mini, gpt-5.4-nano, laguna-m.1, laguna-xs.2) on fixing 20 real CVEs: even at frontier AI fixing is unreliable, with the best solve rate at 50% overall and 60% under the most favorable condition. More troubling than the failures themselves is how they fail: the most dangerous pattern is a patch that looks right, passes every visible test, and leaves the vulnerability intact. False confidence at scale is its own attack surface. The practical cost conclusion is blunt — the expensive models are statistically indistinguishable from cheaper alternatives within the same family, at up to 12× the cost per run.

The agent edited the right file, passed every regression test, and confidently said the bug was fixed. But it wasn’t. The vulnerability was still there: a different branch of the same logic, untouched. Without the sharp eyes of a security researcher, the agent’s plausible-but-incomplete patch would ship undetected. This is the most operationally dangerous failure mode I found, and it showed up repeatedly across models and tasks.

Anthropic recently reported scanning open-source software and finding 1,596 vulnerabilities. As of May 22, 97 have been patched. Their conclusion: discovery is now the easy part; verification, triage, and patching are the bottleneck. I wanted to measure that bottleneck directly: not at the finding stage, but at the fix.

I wanted a real test. So I built CVE-Bench: twenty real-world CVEs, five models, three prompt conditions, each agent running in a sandboxed container and scored against security tests derived from the maintainer’s own fix.

The goal isn’t to rank models, but to understand how they fail.

Advisory, diagnose, locate

The obvious starting point was to hand models the real-world security advisories and see if they could fix the vulnerabilities. When a security researcher finds a flaw, they write an advisory — a structured description of the vulnerability: what it is, how an attacker can exploit it, which code paths are affected. This gets coordinated with maintainers privately, then published once the fix ships. It’s the richest description of a flaw a developer would receive from the outside.

Some advisories are nearly prescriptive: they name the file, the function, the attack vector. Others are thin — a short description with no location, no attack scenario. Giving the agent the full advisory only tells you how well models can map a described vulnerability onto real code — not whether they understand it. It’s akin to a software developer carefully prompting his favorite agent to fix a bug. I wanted to know if there’s something more than pattern matching happening under the hood.

... continue reading