Selection rather than prediction

Coding agents are getting quite good, and the question everyone asks is: which one should I use?

However, agent performance varies considerably by language, task type, and time. When you commit to a single agent, you're predicting it will be best for whatever task you throw at it.

That bet might be informed by evals, experience, or word of mouth. But the variance is high enough that you'll often be wrong.

Selection sidesteps the prediction problem. Generate many candidate implementations, and choose from the pool of solutions. This converts the prediction problem into an optimization problem.

So, we think the question to ask instead is: how many agents should I use, and which ones?

This is often called "best-of-N": run N parallel attempts (here, across different models), then select the best output.

Agents Compete, Humans Arbitrate

We've been running this workflow for a few months now. Here's what it looks like:

We write a spec for the task and fan it out to multiple agents in parallel. Each agent works in its own isolated worktree and runs the repo's evals, then a human reviewer looks at the diffs, picks the best implementation, and applies that patch. The agent whose diff gets applied is the winner.

This is best-of-N with a human judge.

... continue reading