Grace Hopper's Revenge

The world of software has lots of rules and laws. One of the most hilarious is Kernighan’s Law:

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.

I’ve always understood Kernighan’s Law to be about complexity—about keeping the code you write as simple as possible to reason about.

With LLMs now I’m learning it has a lot to do with language design too.

I’m still seeing a decent number of people on Twitter complain occasionally that they’ve tried AI-driven coding workflows and the output is crap and they can move faster by themselves. There’s less of these people in the world of Opus 4.5 and Gemini 3 now, but they’re still there. Every time I see this I want to know what they’re working on and what languages and libraries they’re using.

The big benchmarks for software engineers right now are SWEBench for coding and TerminalBench for computer tasks. Benchmarks are supposed to represent all coding tasks, so it’s critical to note here that SWEBench is focused on Python. TerminalBench involves more varied computer tasks, but when the agents need to write code, they write Python.

These are effectively Python benchmarks.

So what about other languages? Fortunately there’s AutoCodeBench, which doesn’t just test different models— it tests across 20 different programming languages. How’s that look? It looks like this:

AutoCodeBench language benchmarks across models. Elixir, Kotlin, Racket, and C# at the top. PHP, Javascript, Python, and Perl at the bottom.

Now, what we’ve been told about models is that they’re only as good as their training data. And so languages with gargantuan amounts of training data ought to fare best, right?

... continue reading