I've been building agents that write code while I sleep. Tools like Gastown run for hours without me watching. Changes land in branches I haven't read. A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do.
I care about this. I don't want to push slop, and I had no real answer.
I've run Claude Code workshops for over 100 engineers in the last six months. Same problem everywhere, just at different scales. Teams using Claude for everyday PRs are merging 40-50 a week instead of 10. Teams are spending a lot more time in code reviews.
As systems get more autonomous, the problem compounds. At some point you're not reviewing diffs at all, just watching deploys and hoping something doesn't break.
So the question I kept coming back to: what do you actually trust when you can't review everything?
The obvious answers don't work
You could hire more reviewers. But you can't hire fast enough. And making senior engineers read AI-generated code all day isn't worth it.
When Claude writes tests for code Claude just wrote, it's checking its own work. The tests prove the code does what Claude thought you wanted. Not what you actually wanted. They catch regressions but not the original misunderstanding.
When you use the same AI for both, you've built a self-congratulation machine.
This is exactly the problem code review was supposed to solve: a second set of eyes that wasn't the original author. But one AI writing and another AI checking isn't a fresh set of eyes. They come from the same place. They'll miss the same things.
... continue reading