Tech News
← Back to articles

We built a real-world benchmark for AI code review

read original related products more articles

This blog reflects a collaborative effort by Qodo’s research team to design, build and validate the benchmark and this analysis.

This blog introduces the Qodo’s code review benchmark 1.0, a rigorous methodology developed to objectively measure and validate the performance of AI-powered code review systems, including Qodo Git Code Review. We address critical limitations in existing benchmarks, which primarily rely on backtracking from fix commits to buggy commits, thereby narrowly focusing on bug detection while neglecting essential code quality and best-practice enforcement. Furthermore, previous methods often utilize isolated buggy commits instead of simulating a complete pull request (PR) review scenario on genuine, merged PRs, and are constrained by a small scale of PRs and issues.

Our research establishes a new standard by intentionally injecting defects into genuine, merged pull requests sourced from active, production-grade open-source repositories

This novel approach is uniquely designed to simultaneously evaluate both code correctness (bug detection) and code quality (best practice enforcement) within a realistic code review context and at a significantly larger scale – 100 PRs containing a total of 580 issues. In a comparative evaluation pitting the Qodo model against 7 other leading AI code review platforms, Qodo demonstrated superior performance, achieving an F1 score of 60.1% for reliably identifying this diverse set of defects. The benchmark, including tools evaluated reviews, is publicly available in our benchmark GitHub organization. The following sections detail our benchmark creation methodology, experimental setup, comparative results, and key takeaways from this evaluation.

Related work

While there are many benchmarks for AI code generation and bug fixing, SWE‑Bench being the most well-known, the code review domain has historically lacked robust evaluation datasets. Greptile made an important first step by creating a benchmark based on backtracking from fix commits, measuring whether AI tools could catch historically fixed bugs. Augment also used this approach to evaluate several AI code review tools. These methods are effective at spotting real bugs but are limited in scale, often only a single bug per commit review, and do not capture the size, complexity, or context of full pull requests.

Qodo takes a different approach by starting with real, merged PRs and injecting multiple issues, including both functional bugs and best-practice violations. This enables the creation of larger, more realistic benchmarks for testing AI tools in system-level code review scenarios, capturing not just correctness but also code quality and compliance. By combining larger-scale PRs with dual-focus evaluation, Qodo provides a more comprehensive and practical benchmark than prior work, reflecting the full spectrum of challenges encountered in real-world code review.

Methodology

The Qodo Code Review Benchmark is constructed through a multi-stage process of injecting complex and non-obvious defects into real-world, merged pull requests (PRs) from active open-source repositories. This controlled injection approach is fundamentally designed to simultaneously evaluate both core objectives of a successful code review: code correctness (issues detection) and code quality (best practice enforcement). This integrated design allows us to create the first comprehensive benchmark that measures the full spectrum of AI code review performance, moving beyond traditional evaluations that focus mostly on isolated bug types.

Importantly, this injection-based methodology is inherently scalable and repository-agnostic. Because the process operates on real merged PRs and extracts repository-specific best practices before injection, the benchmark generation can be applied to large volumes of PRs and to any codebase, open-source or private. This makes the framework a general mechanism for generating, high-quality code review evaluation data at scale.

... continue reading