GPT-5 is now available in Qodo’s platform for all free and paid users. Get started today.
At Qodo, we believe benchmarks should reflect how developers actually work. That’s why we built the PR Benchmark—a benchmark designed to assess how well language models handle tasks like code review, suggesting improvements, and understanding developer intent.
Unlike many public benchmarks, the PR Benchmark is private, and its data is not publicly released. This ensures models haven’t seen it during training, making results fairer and more indicative of real-world generalization.
We recently evaluated a wide range of top-tier models, including variants of the newly-released GPT-5, as well as Gemini 2.5, Claude Sonnet 4, Grok 4, and others. The results are promising across the board, and they offer a snapshot of how rapidly this space is evolving.
What Is the PR Benchmark?
Qodo’s PR Benchmark is designed to evaluate how well LLMs perform core pull request review tasks. The PR Benchmark tests model performance across a dataset of 400 real-world PRs from over 100 public repositories, covering multiple languages, frameworks and styles.
The PR Benchmark tests how well models can:
Understand and reason about code diffs , including logic, semantics, and context from surrounding code.
, including logic, semantics, and context from surrounding code. Identify bugs, issues, or opportunities for improvement , not just style or formatting changes.
, not just style or formatting changes. Make precise, actionable suggestions , such as proposed code edits or concise inline comments.
, such as proposed code edits or concise inline comments. Respect project-specific patterns, including naming conventions, architectural boundaries, and functional behavior.
It’s a focused benchmark meant to complement existing leaderboards, not replace them – targeting tasks that are high-signal for developer productivity and code quality.
Benchmark Methodology
To evaluate a model, we compare its code review suggestions on 400 real-world pull requests against outputs from eleven top-performing baseline models. All suggestions are generated using Qodo Merge’s “Improve” tool.
Each model’s responses are ranked by a high-performing judge model — typically OpenAI’s o3 — which compares outputs for quality, relevance, and clarity. These rankings are then aggregated to produce a performance score.
Beyond scoring, we analyze judge feedback to understand each model’s strengths and weaknesses. This gives us a well-rounded view of how the model performs in practical code review scenarios — both quantitatively and qualitatively.
The Results: GPT-5 Leads in code review performance
GPT-5: Raising the Bar on Developer Tasks
The latest release of GPT-5 delivered the strongest performance we’ve seen so far on the PR Benchmark. Its medium-budget variant scored 72.2, with the low-budget version following closely at 70.9-an impressive result, especially given the speed-efficiency tradeoff.
But what truly stands out is the “minimal” GPT-5 variant. Designed for lightweight responsiveness, it still achieved a score of 58.5, placing it among the top performers. This reflects a growing shift toward models that balance quality with speed, and we’re excited to see that momentum grow.
Qodo Merge code suggestions with GPT-5
Deep Dive: GPT-5 powered Code Review
GPT-5 stood out for its analytical strength and review clarity. Here are some anecdotes to better understand what top-tier performance looks like beyond the raw score.
Strengths:
Broader bug coverage & critical focus : Often the only model to catch critical issues like security flaws or compile-breakers.
: Often the only model to catch critical issues like security flaws or compile-breakers. Precise, concise patches : Minimal, valid diffs that touch only new lines-no style noise, just impact.
: Minimal, valid diffs that touch only new lines-no style noise, just impact. Rule compliance & clarity : Strong adherence to review constraints with short, well-reasoned justifications.
: Strong adherence to review constraints with short, well-reasoned justifications. Criticality filtering: Frequently returns nothing when there’s no real issue-avoiding unnecessary churn.
Weaknesses:
False positives : A few reviews include incorrect or harmful fixes.
: A few reviews include incorrect or harmful fixes. Inconsistent labeling : Occasionally misclassifies the severity of findings or touches forbidden lines.
: Occasionally misclassifies the severity of findings or touches forbidden lines. Redundancy: Some repetition or trivial suggestions that dilute review utility.
Bottom line: this model consistently delivers reviews that find more real issues, write cleaner patches, and do so with strong reasoning transparency. It’s another example of how models can be optimized not just for benchmarks, but also for real developer trust.
A Word on “Minimal Models”: Speed Is the Next Frontier
As AI adoption in developer tools grows, response latency becomes just as critical as output quality. The “minimal” variant of GPT-5 reflects this shift – offering near real-time interaction while still making helpful, context-aware suggestions.
This is especially important for use cases like:
Code suggestions in IDEs
Instant PR comment generation
Fast code review workflows during CI/CD
We believe the broader ecosystem is recognizing this need, and we’re excited to see models and frameworks optimizing for developer velocity.
A Rapidly Evolving Landscape
What’s most exciting is how quickly this field is advancing. Models like Gemini 2.5, Claude 4, Grok 4, and GPT-5 are showing steady improvements, and each brings a distinct design philosophy to the table.
Some are optimized for token efficiency, others for scale, and some for low-latency interactions. While scores vary across the PR Benchmark, we see strengths and innovation in every approach.
We see these results not as competition, but as a collective reflection of how fast – and how collaboratively – this space is moving.
Every model release raises the bar for the next, and we’re genuinely inspired by that momentum.
Why Benchmarks Like This Matter
The PR Benchmark helps fill a crucial gap for developer-facing tools:
How well can a model support real-world code review workflows?
As part of a growing suite of practical evaluations, it helps tool builders and model providers alike understand what’s working – and where there’s room to grow.
We’re actively working to expand the benchmark to include:
More languages
Multi-file pull requests
Long-context reasoning
Multilingual developer environments
Try GPT-5 in Qodo today
GPT-5 is available for both free and paid users of Qodo’s agents in IDE, Git and CLI. Get started today with GPT-5 for only 1 credit for the next 30 days. Get started today.