FrontierCode

Introducing FrontierCode By Eric Lu, Ben Pan, Deniz Birlikci, Sam Lee, Ray Wang, Rohan Choudhury, Fermi Ma, TC Qin, Carlo Baronio, Silas Alberti , and more → 06.08.26

Raising the bar from correctness to quality # Today’s coding benchmarks have established that models can write correct code. But as AI-generated code becomes the dominant path to production, correctness is now table stakes. The question that we should be asking is: can models actually write good code? We’re excited to introduce FrontierCode, a benchmark that measures how well models can truly meet the standards of high-quality production codebases. What sets us apart: Would the maintainer actually merge this PR? We’re the first benchmark to measure code mergeability. Our criteria assess end-to-end code quality — correctness, test quality, scope discipline, style, and adherence to codebase standards. This employs a novel ensemble of grading techniques, including unit tests, rubrics, and new types of verifiers.

Crafted by open-source maintainers. 20+ world-class open-source developers built realistic, diverse, and challenging coding tasks from the repos they maintain, spending more than 40 hours per task. They define what “mergeable” means in their repo.

Rigorous quality control. Rubric grading is subjective, so we built an extensive QC pipeline with adversarial testing, calibration, and multi-stage review, where every task is manually reviewed by a Cognition researcher. We achieve an 81% lower false positive rate compared to SWE-Bench Pro. Our benchmark provides the strongest available signal of a model’s ability to write high-quality, maintainable code. We find that even today’s most capable models struggle on this new standard.

20+ world-class open-source maintainers 40 hours effort per task Manually reviewed by Cognition researchers Every task 81% lower false positive rate Compared to SWE-Bench Pro First-ever benchmark measuring code quality And subtle human preferences

Results # We present three nested subsets of FrontierCode at increasing difficulty: Extended, Main, and Diamond. Diamond comprises the 50 hardest tasks, Main the 100 hardest (including Diamond), and Extended the full set of 150. We report two metrics, pass rate and score: A solution passes if it clears all blocker criteria, i.e., criteria that a maintainer would consider hard stops during code review, and fails otherwise.

A solution’s score is a weighted aggregate of the rubric items. Solutions that do not pass blocking criteria receive 0. Each model is run 5 times at every available reasoning effort. For each effort, we average the metric across the 5 trials, then report each model’s score at its best performing reasoning level.

FrontierCode Diamond remains unsaturated: the best performing model, Claude Opus 4.8, achieves a score of only 13.4%. Other models score significantly lower: GPT-5.5 receives 6.3%, Gemini 3.1 Pro 4.7%, and others even less. However, GPT 5.5 consistently uses up to 4x fewer tokens than Opus 4.8, achieving a better cost-intelligence tradeoff. On FrontierCode Main and Extended, Opus 4.8 still maintains a clear lead, at 34.3% and 51.8%, respectively. We also observe a large gap between open-source models and the frontier. Kimi K2.6, the best-performing open-source model, achieves just 3.8% on Diamond, 16% on Main and 37% on Extended. The rest of this post will be a deep dive into why and how we built FrontierCode.

Why we built FrontierCode # The first generation of coding benchmarks, such as SWE-Bench Verified and Pro, were designed for less capable models. They fall short on many measures of realism and robustness. Fundamentally, they only test functional correctness, not quality. Moreover, these benchmarks are prone to misclassification errors. Experiments from METR have found that high-scoring models on these benchmarks often produce patches that wouldn’t be accepted by human maintainers. How do we define misclassifications? These fall under two categories: False Positives: The verifier should not reward solutions that are wrong. Test coverage may be incomplete, allowing the model to write an incorrect solution that’s still accepted.

False Negatives: The verifier should not penalize solutions that are correct. Tests can be either too specific, e.g. checking for exact error strings or function names, or unsolvable, testing for a behavior not in the instruction or in the codebase. We show through analysis of agent trajectories that FrontierCode produces 81% less misclassification errors than other leading benchmarks. This means that FrontierCode scores are the most accurate ranking currently available. Existing benchmarks also suffer from lack of diversity in several ways. While other benchmarks generated issues from single PRs via programmatic scraping, FrontierCode is hand-selected by repo maintainers from multi-PR chains and freeform requests. We also triple the number of represented languages from SWE-Bench Pro. It’s also known that existing benchmarks provide too much guidance in the form of overly specified and detailed prompts. Today’s frontier models need far less hand-holding. FrontierCode expects the agent to infer the maintainer’s intent, given the same context as a human contributor. Our prompts contain two parts. First is the task description. Second, the codebase guidelines for generic testing, lint, and style practices, just like those found in AGENTS.md. The task descriptions are humanlike and deliberately concise — a third the length of SWE-Bench Pro’s.

... continue reading