TREX: An AI code reviewer that runs your code

I'm Shlok, a software engineer at Greptile. We recently built a code reviewer that, in addition to reviewing pull requests, actually runs the code and shows you what went wrong.

In 1976, Michael Fagan published a paper introducing formal code inspection at IBM. Developers would print out listings, sit in a room together, and read through the code line by line.

Today we still read a diff on a screen. AI tools have made that faster, though most of them are still just reading the code. This approach works for a lot of bugs, the ones that announce themselves plainly in code.

The problem is there's a whole category of bugs that don't show up in code at all; they exist when the program is running. Think of the logic error that needs a specific sequence of state, the UI regression that appears after the page loads, or the race condition that needs a real request. You can read the diff perfectly and still miss these types of bugs completely.

Static code review has a ceiling. It can reason about what the code says. It can't tell you what it does. TREX (which stands for "Test, Run, Execute") is Greptile's response to that ceiling: an execution layer built directly into code review.

Orchestrating agents without wasting context

TREX started as a completely separate product from Greptile, as a standalone agent that generated and ran tests. We hoped that bugs would surface as a result. They didn't. Generating tests wasn't the same activity as finding bugs. When the separate TREX agent tried to write tests, the tests weren't relevant to what the user was trying to do. This created unnecessary noise, and it also missed edge cases. This sounds obvious in hindsight, but it took us more time than expected to learn this lesson.

We'd built these agents to be separate with the assumption it would give each agent its own context window. It also meant both agents ran separately without sharing knowledge. They often overlapped, exploring the same parts of the codebase twice without either agent knowing what the other had already found, ultimately leading to wasted compute.

The obvious fix seemed like combining them into one agent. We tried that, and ran into a different problem: a single agent handling the full review got overloaded. Between spinning up services, taking screenshots, running tests, there was too much context for one agent to manage cleanly.

The solution was to make TREX share the same context as the main Greptile reviewer rather than having it exist entirely as a separate product. It was the first time we were managing agents from within an agent. Unlike two independent agents, this means TREX doesn't start from scratch. It inherits what the Greptile reviewer agent already found, has its own context window, and is scoped to the specific problem it's been asked to investigate.

... continue reading