Building a better Bugbot

As coding agents became more capable, we found ourselves spending more time on review. To solve this, we built Bugbot, a code review agent that analyzes pull requests for logic bugs, performance issues, and security vulnerabilities before they reach production. By last summer, it was working so well that we decided to release it to users.

The process of building Bugbot began with qualitative assessments and gradually evolved into a more systematic approach, using a custom AI-driven metric to hill-climb on quality.

Since launch, we have run 40 major experiments that have increased Bugbot's resolution rate from 52% to over 70%, while lifting the average number of bugs flagged per run from 0.4 to 0.7. This means that the number of resolved bugs per PR has more than doubled, from roughly 0.2 to about 0.5.

We released Version 1 in July 2025 and Version 11 in January 2026. Newer versions caught more bugs without a comparable rise in false positives.

# Humble beginnings

When we first tried to build a code review agent, the models weren't capable enough for the reviews to be helpful. But as the baseline models improved, we realized we had a number of ways to increase the quality of bug reporting.

We experimented with different configurations of models, pipelines, filters, and clever context management strategies, polling engineers internally along the way. If it seemed one configuration had fewer false positives, we adopted it.

One of the most effective quality improvements we found early on was running multiple bug-finding passes in parallel and combining their results with majority voting. Each pass received a different ordering of the diff, which nudged the model toward different lines of reasoning. When several passes independently flagged the same issue, we treated it as a stronger signal that the bug was real.

After weeks of internal qualitative iterations, we landed on a version of Bugbot that outperformed other code review tools on the market and gave us confidence to launch. It used this flow:

Run eight parallel passes with randomized diff order Combine similar bugs into one bucket Majority voting to filter out bugs found during only one pass Merge each bucket into a single clear description Filter out unwanted categories (like compiler warnings or documentation errors) Run results through a validator model to catch false positives Dedupe against bugs posted from previous runs

... continue reading