I've been using AI fairly heavily since last November and the whole thing is a funny experience . An agent will do something that, if a human did it, you'd immediately fire them. My reaction, of course, is to act as if this is great and spin up a thousand agents so they can do even more of that.
Mid-last year, I had GPT (maybe 5.0 or 5.1) try to find the source of a bug . Naturally, this code didn't have tests and git bisect wouldn't work, and it was a UI interaction bug for which I'm not even really qualified to write a test for, so I asked codex to bisect between dates X and Y to find the commit that introduced this bug. Codex immediately told me the offending commit was after this date range (which couldn't possibly be correct). On telling codex this was wrong, it then told me some commit that was obviously also not the offending commit once or twice. On telling it those were wrong, it then told me the offending commit was some plausible looking commit. When I asked it to prove or disprove its theory, it told me that it wrote a test and confirmed that the alleged commit was the breaking commit.
I then asked it to show me by making a video with the full developer end-to-end stack in the normal browser test environment. It claimed that it didn't have permissions to do that ( which was a lie ), but it could make video of the execution of the repro before and after the commit in playwright with the appropriate test code. The video was convincing and showed the feature working properly before the commit and failing to work after the commit. Something about this didn't feel right, so I tried reproducing the issue by hand before and after the commit and found out that the whole thing was a fabrication. The video made it look like codex had reproduced the bug, but it was an artificial browser environment that was designed to create a fake repro, not the real environment.
Like I said, because this was non-ironically such a great experience , I immediately thought to myself, "how can I get more of this?" and started using agents more and more heavily until I was using coding agents heavily mid-late last year .
Since this post covers a relatively disparate set of topics, here's a brief outline .
Testing background
LLMs are highly leveraged when it comes to testing. In terms of the amount of effort it takes, it's easier than ever to hit a particular quality bar and yet, software seems to be lower quality than ever. A decade ago, we looked at the bugs I ran into in an arbitrary week. There were quite a few bugs then and I run into more bugs now, but I don't think this has to be the case.
For one thing, after a bug has been shipped, it's easier than it's ever been to use a data-driven approach to find and fix the bug. Just for example , at work, I tried creating a pipeline that goes from support ticket (chat or email) to pull request (PR). As far as I can tell, this works ok . Since I work for a company that has a traditional workflow, all of these fixes get reviewed by a human and, so far, we've had no known false positives .
Per unit of time invested, it's also possible to do more thorough testing. Personally, I think this can be effective enough that I'm fairly comfortable trying to ship a large volume of code via "software factories" workflow because I've seen a testing-heavy no-review workflow that results in much higher quality than any review-reliant workflow I've seen or even heard of.
Like everybody, I have biases that fall out of my experiences. It just so happens that I spent the first decade of my career at a company whose test processes happen to work well in today's LLM environment. I talked about fuzzing as a default testing methodology on Mastodon, and a skeptic tried it out and immediately found some bugs:
... continue reading