Using AI agents correctly is a process of reviewing code. If you’re good at reviewing code, you’ll be good at using tools like Claude Code, Codex, or the Copilot coding agent.
Why is that? Large language models are good at producing a lot of code, but they don’t yet have the depth of judgement of a competent software engineer. Left unsupervised, they will spend a lot of time committing to bad design decisions.
AI agents and bad design
Last week I built VicFlora Offline: an offline-friendly PWA that hosts some of the VicFlora data for keying out plants, so you can still use the keys if you’re in the field somewhere with bad internet reception. Codex spent a lot of effort trying to reverse-engineer the VicFlora frontend code for the dichotomous key. It was honestly pretty impressive to watch! But I figured there had to be some easier way to access the raw data, and I was right. This happens over and over again when I use AI coding agents: about once an hour I notice that the agent is doing something that looks suspicious, and when I dig deeper I’m able to set it on the right track and save hours of wasted effort.
I’m also working on an app that helps me learn things with AI - think of it as an infinite, automatically-adjusting spaced-repetition feed. When I want to do things in parallel (e.g. generating a learning plan in the background), both Codex and Claude Code really want to build a full background job infrastructure: with job entities, result polling, and so on. I like background jobs, but for ordinary short-lived parallel work they are very obviously overkill. Just make a non-blocking request from the frontend! If I weren’t consistently pushing for simplicity, my codebase would be much more complex to reason about.
Incidentally, this is why I think pure “vibe coding” hasn’t produced an explosion of useful apps. If you don’t have the technical ability to spot when the LLM is going down the wrong track, you’ll rapidly end up stuck. Trying to make a badly-designed solution work costs time, tokens, and codebase complexity. All of these things cut into the agent’s ability to actually solve the problem. Once two or three of them pile up, the app is no longer tractable for the agent and the whole thing grinds to a halt.
Code review
These examples should be familiar to anyone who’s spent enough time working on an engineering team with enthusiastic juniors. Diving right in to an early idea and making it work with sheer effort is a very common mistake. It’s the job of the rest of the team to rein that in. Working with AI agents is like working with enthusiastic juniors who never develop the judgement over time that a real human would.
This is a good opportunity to talk about what I think is the biggest mistake engineers make in code review: only thinking about the code that was written, not the code that could have been written. I’ve seen even experienced engineers give code reviews that go through the diff with a fine-toothed comb, while spending approximately zero seconds asking if this is even the right place for the code at all.
In my view, the best code review is structural. It brings in context from parts of the codebase that the diff didn’t mention. Ideally, that context makes the diff shorter and more elegant: for instance, instead of building out a new system for operation X, we can reuse a system that already exists. Instead of building a fragile scraping pipeline that pulls dichotomous key IDs from the frontend SPA code, let’s just download the dichotomous keys from this other place where they’re explicitly made available. Instead of building out an entire background job system, let’s just do our parallel work on the client, using all the existing machinery that websites have for doing two things at the same time.
... continue reading