I was reading the metr article on how llm code passes test much more often than it is of mergeable quality. They look at the performance of llm s doing programming when the success criterion is “passes all tests” and compare it to when the success criterion is “would get approved by the maintainer”. Unsurprisingly, llm performance is much worse under the more stringent success criterion. Their 50 % success horizon moves from 50 minutes down to 8 minutes.
As part of this they have included figures such as this one:
But there’s something about it that strikes me as odd. Let’s look only at the more valuable data, the merge rates.
What line best characterises this data? metr suggested something that slopes slightly upwards. But here’s what I see:
At some point toward the end of 2024 we may have had a step up in ability, but this plot shows no evidence of any actual improvement in merge rates since early 2025.
Fisher warns us against eyeballing plots, so let’s make it more formal. We’ll use leave-one-out cross-validation and compare the linear slope suggested by metr against the step function the plot hints at.
Model Brier score Gentle upward slope 0.0129 Piecewise constant 0.0117
The Brier score is a form of squared error, thus lower is better. This means the step function has more predictive power (“fits better”) than the linear slope. For fun, we can also fit a function that is completely constant across the entire timespan. That happens to get the best Brier score.
Model Brier score Gentle upward slope 0.0129 Piecewise constant 0.0117 Constant function 0.0100
Stop and think about what this means: the two models that predict constant merge rates over the latter half of the plot are more accurate than the linear growth trend.1 And more accurate than a logistic trend, since linear in log-odds is nearly linear in probability for this range of values. This corroborates what we eyeballed in the plots: the merge rate has not increased in the latter half of this plot.
... continue reading