In general, ability to perform a task drops off as its duration increases, so they use the AI agent’s performance on tasks of different lengths to estimate the task-length at which the model would have a 50% success rate. They then showed that this length has been doubling every 7 months as the capabilities of frontier agents improve. The task-lengths are measured by how long it took humans to solve the same tasks.
They used 50% success rate as their chief performance threshold because it is the easiest level to robustly estimate. They are well aware that for many tasks, the required success rate for useful work may be much higher — such as 80%, 99%, or 99.9999%. They do measure the 80% success rate and find their mean estimate to have a doubling time of 213 days, compared to 212 days of the 50% rate. These are close to identical within their margin of error (±40 days for the 50% rate), so they conclude that the particular threshold doesn’t seem to have much effect on their headline result about the rate of improvement.
But there is quite a gap between the 50% success rate time-horizon and the 80% success rate time horizon. For the best model (Claude 3.7 Sonnet) it could achieve a 50% success rate on tasks up to 59 minutes vs only 15 minutes if an 80% success rate was required. If those results generalise to the other models, then we could also see it like this: the task length for an 80% success rate is 1/4 the task length for a 50% success rate. Or in terms of improvement: what is doable with a 50% success rate now is doable with an 80% success rate in 14 months’ time (= 2 doubling times).
The idea of measuring improvement in AI capabilities over time via time horizons at a chosen success rate is novel and interesting. AI forecasting is often hamstrung by the lack of a good measure for the y-axis of performance over time. We can track progress within a particular benchmark, but these are often solved in a couple of years, and we lack a good measure of underlying capability that can span multiple benchmarks. METR’s measure allows comparisons between very different kinds of tasks in a common currency (time it takes a human) and shows a strikingly clear trend line — suggesting it is measuring something real.
But there are grounds for criticism too. In particular, there is room to wonder how much these results generalise outside of this kind of task suite. We know that there are some tasks humans can do very quickly that AIs can’t solve (e.g. some simple spatial reasoning or intuitive physics tasks) and others that would take humans an extremely long time, but AIs can do quickly (e.g. rote mathematics). So a simple measure of ‘time it would take a human’ cannot explain all AI capability improvements. Kwa et al. are aware of this and even list several other ways that this task-suite may be non-representative of real-world performance including:
All tasks were automatically scorable (a domain where RL works best)
No interaction with other agents
Lax resource constraints
For the purposes of this essay, we will take the data for what it is (performance on a particular task suite that may or may not generalise further) and explore underlying mechanisms that could explain it.
Explaining these results via a constant hazard rate
These results call out for some explanation of what is going on. For example, exactly how does the time horizon shrink as the required success probability is increased? And what does it mean for an agent to be able to perform an 8-hour task, but not a 16-hour task. Isn’t a 16-hour task just one 8-hour task after another?
Survival analysis is the field of understanding how the probability of something failing increases as a function of time. It tracks the survival probability at a time S(t) — that is, the chance it still hasn’t failed by that point. The simplest model in survival analysis is a constant hazard rate. This means that the chance of something failing in the next step (conditional on making it that far) is constant. A constant hazard rate leads to an exponentially declining survival curve. This behaviour is well-known from phenomena like radioactive decay, where there is a constant chance of decay at any moment, leading to an exponentially declining chance of the isotope’s survival over time, which is often measured by a half-life.
If AI agent success-rates drop off with task length in this manner, then the 50% success rate time-horizon for each agent from Kwa et al. is precisely the half-life of that agent. As with the half-life of a radioisotope, this isn’t just the median lifespan, it is the median remaining lifespan starting at any time — something that is only possible for an exponential survival curve. Unlike for particles, this AI agent half-life would be measured not in clock time, but in how long it takes a human to complete the task.
One rationale for this constant hazard rate model for AI agents is that tasks require getting past a series of steps each of which could end your attempt, with the longer the duration of the task, the more such steps. More precisely, if tasks could be broken down into a long sequence of equal-length subtasks with a constant (and independent) chance of failure, such that to succeed in the whole task, the agent needs to succeed in all subtasks, then that would create an exponential survival curve. I.e. when Pr(Task) = Pr(Subtask$_1$ & Substask$_2$ & … & Subtask$_N$).
But we don’t have to assume a perfect breakdown of the task into equal-length-equal-difficulty subtasks in order to get an exponential distribution (and corresponding constant hazard rate). We can also think of the constant hazard rate model as being agnostic as to how the task is broken down, just saying that the chance of succeeding in a subtask of duration t is always equal to the chance of succeeding in each of a set of smaller subtasks whose combined duration is also t. So on this model, it doesn’t matter what level of granularity you assess the task at — whether you see it as one 60-minute task, six 10-minute tasks, or a 20-minute task plus forty 1-minute tasks — the chance of succeeding is set by the total time it would take a human to complete it.
This constant hazard rate model would predict that the time horizon for an 80% success rate is about ⅓ of the time horizon for a 50% success rate. This is because the chance of surviving three periods with an 80% success rate = (0.8)$^3$ = 0.512 ≈ 50%. More precisely, the time horizon for a success probability of p would be ln(p)/ln(q) times as long as one with success probability q. So an 80% time-horizon would be ln(0.8)/ln(0.5) = 0.322 times as long as the 50% time-horizon. Kwa et al. estimate the 80% time-horizon for Claude 3.7 Sonnet to be 0.25 as long, which is close to this theoretical estimate and within the margin of error given the noisiness of the results. The following chart shows how these numbers relate to the exponential survival curve (where T$_{50}$ is the 50% time-horizon etc.).