How I write software with LLMs

I don't care for the joy of programming

Lately I’ve gotten heavily back into making stuff, and it’s mostly because of LLMs. I thought that I liked programming, but it turned out that what I like was making things, and programming was just one way to do that. Since LLMs have become good at programming, I’ve been using them to make stuff nonstop, and it’s very exciting that we’re at the beginning of yet another entirely unexplored frontier.

There’s a lot of debate about LLMs at the moment, but a few friends have asked me about my specific workflow, so I decided to write it up in detail, in the hopes that it helps them (and you) make things more easily, quickly, and with higher quality than before.

I’ve also included a real (annotated) coding session at the end. You can go there directly if you want to skip the workflow details.

The benefits

For the first time ever, around the release of Codex 5.2 (which feels like a century ago) and, more recently, Opus 4.6, I was surprised to discover that I can now write software with LLMs with a very low defect rate, probably significantly lower than if I had hand-written the code, without losing the benefit of knowing how the entire system works. Before that, code would quickly devolve into unmaintainability after two or three days of programming, but now I’ve been working on a few projects for weeks non-stop, growing to tens of thousands of useful lines of code, with each change being as reliable as the first one.

I also noticed that my engineering skills haven’t become useless, they’ve just shifted: I no longer need to know how to write code correctly at all, but it’s now massively more important to understand how to architect a system correctly, and how to make the right choices to make something usable.

On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet, even at tens of thousands of SLoC. Most of that must be because the models are getting better, but I think that a lot of it is also because I’ve improved my way of working with the models.

One thing I’ve noticed is that different people get wildly different results with LLMs, so I suspect there’s some element of how you’re talking to them that affects the results. Because of that, I’m going to drill very far down into the weeds in this article, going as far as posting actual sessions, so you can see all the details of how I develop.

Another point that should be mentioned is that I don’t know how models will evolve in the future, but I’ve noticed a trend: In the early days of LLMs (not so much with GPT-2, as that was very limited, but with davinci onwards), I had to review every line of code and make sure that it was correct. With later generations of LLMs, that went up to the level of the function, so I didn’t have to check the code, but did have to check that functions were correct. Now, this is mostly at the level of “general architecture”, and there may be a time (next year) when not even that is necessary. For now, though, you still need a human with good coding skills.

... continue reading