Is there a half-life for the success rates of AI agents?

In general, ability to perform a task drops off as its duration increases, so they use the AI agent’s performance on tasks of different lengths to estimate the task-length at which the model would have a 50% success rate. They then showed that this length has been doubling every 7 months as the capabilities of frontier agents improve. The task-lengths are measured by how long it took humans to solve the same tasks.

They used 50% success rate as their chief performance threshold because it is the easiest level to robustly estimate. They are well aware that for many tasks, the required success rate for useful work may be much higher — such as 80%, 99%, or 99.9999%. They do measure the 80% success rate and find their mean estimate to have a doubling time of 213 days, compared to 212 days of the 50% rate. These are close to identical within their margin of error (±40 days for the 50% rate), so they conclude that the particular threshold doesn’t seem to have much effect on their headline result about the rate of improvement.

But there is quite a gap between the 50% success rate time-horizon and the 80% success rate time horizon. For the best model (Claude 3.7 Sonnet) it could achieve a 50% success rate on tasks up to 59 minutes vs only 15 minutes if an 80% success rate was required. If those results generalise to the other models, then we could also see it like this: the task length for an 80% success rate is 1/4 the task length for a 50% success rate. Or in terms of improvement: what is doable with a 50% success rate now is doable with an 80% success rate in 14 months’ time (= 2 doubling times).

The idea of measuring improvement in AI capabilities over time via time horizons at a chosen success rate is novel and interesting. AI forecasting is often hamstrung by the lack of a good measure for the y-axis of performance over time. We can track progress within a particular benchmark, but these are often solved in a couple of years, and we lack a good measure of underlying capability that can span multiple benchmarks. METR’s measure allows comparisons between very different kinds of tasks in a common currency (time it takes a human) and shows a strikingly clear trend line — suggesting it is measuring something real.

But there are grounds for criticism too. In particular, there is room to wonder how much these results generalise outside of this kind of task suite. We know that there are some tasks humans can do very quickly that AIs can’t solve (e.g. some simple spatial reasoning or intuitive physics tasks) and others that would take humans an extremely long time, but AIs can do quickly (e.g. rote mathematics). So a simple measure of ‘time it would take a human’ cannot explain all AI capability improvements. Kwa et al. are aware of this and even list several other ways that this task-suite may be non-representative of real-world performance including:

All tasks were automatically scorable (a domain where RL works best)

No interaction with other agents

Lax resource constraints

For the purposes of this essay, we will take the data for what it is (performance on a particular task suite that may or may not generalise further) and explore underlying mechanisms that could explain it.

Explaining these results via a constant hazard rate

... continue reading