CVE-Bench: testing LLM agents on real-world vulnerability patches

~15 min read

Correction (2026-05-28): Five security tests in the original benchmark were found to reject valid alternative fixes that nonetheless addressed the reported vulnerability. Results were recalculated after correcting the tests. Solve rates increased by 3–7 points per model; the ranking order is unchanged, but cross-family pairwise comparisons that previously fell short of significance now cross α = 0.05 under McNemar with continuity correction. All affected numbers and statistical conclusions in this post have been updated.

TL;DR — I evaluated five frontier models (three OpenAI, two Poolside) on fixing 20 real CVEs across three prompt types: full advisory, behavioral description only, and file+function location only. No model reliably fixes real vulnerabilities: The best solve rate (gpt-5.5) is 50% overall and 60% under the most favorable condition (full advisory). All four cross-family pairwise comparisons reach statistical significance under McNemar with continuity correction (p ≤ 0.040); within-family comparisons do not. The failure modes (wrong-search drift, budget exhaustion, partial fixes) are structured and repeatable. Token cost varies by 4× for equivalent outcomes. The locate condition, ie. fix code without description of the flaw, is the sharpest instrument, and every model weakens there.

In early 2026, Anthropic claimed Mythos – one of their latest models – finds security vulnerabilities better than human experts. Yet, the number of security vulnerabilities keeps rising anyway.

I wanted to test how well models do in fixing vulnerabilities. Poolside’s Laguna models arrived this year, and I was looking for a real environment to put them through. SWE-Bench, the default benchmark, tests for general code; I wanted something with sharper stakes.

So, I thought, why not create a benchmark specifically for real-world security? That’s CVE-Bench. Twenty real-world CVEs, five models, three prompt conditions. Each agent runs in a sandboxed container and is scored against the maintainer’s security tests (with some adaptations).

Hopefully, benchmarks like this one will help the community fix these issues before they can be exploited.

The anatomy of security vulnerabilities

When a security researcher finds a vulnerability, they follow responsible disclosure: contact the maintainers privately with an advisory, a structured description of the flaw, and coordinate a fix before going public. A CVE identifier is assigned and the advisory published once the fix is released so users can update vulnerable dependencies.

There is a continuing effort to catalogue vulnerabilities in open-source software. Typically, the GitHub Advisory Database (GHSA) allows to link CVEs and advisories to repositories, maintainers, and fixed versions.

... continue reading