Inverse Rubric Optimization: A testbed for agent science

“It is important to draw wisdom from many different places. If you take it from only one place, it becomes rigid and stale.” — Uncle Iroh

At Fulcrum Research, we study the performance and behavior of long-horizon agents. Although each task setting has its own specific structure, we believe it’s possible to find general principles of agent performance across settings, each contributing to a nascent agent science.

In this post, we motivate the difficulty of finding suitable settings for agent science and propose inverse rubric optimization (IRO) settings, in which an agent has to optimize the preferences of a blackbox judge it has variable access to. We observe these tasks induce rich behavior and smooth scaling. We find that frontier models effectively iterate and improve with more judge access but by default do not maximally use the resources provided to them. Notably, Fable 5 outperforms all models given smaller amounts of labels, but does not improve at the largest budget and plateaus around the level of Opus 4.6. We open source code here.

Testbeds for agent science

Studying the behavior of agents is challenging due to the variance and cost of long horizon tasks. Trajectory-level variance is often notoriously high due to the many non-deterministic choices made in a run, making it hard to estimate the impact of various methods. Often this variance is precisely high in the tasks with large action spaces. But those tasks are those that induce the exact complex behaviors we intend to study and intervene on in our experiments.

The challenge is then to find settings that require general kinds of capability and benefit from a broad range of strategies, like resource utilization, exploration, hypothesis testing, etc., while being smooth enough for research.

To remedy this, we look at toy settings that remain challenging, rich and smooth.

Inverse rubric optimization

In an IRO task, the agent being evaluated has the goal of learning the preferences of a black-box judge model, parametrized by some judging rubric. It submits a policy for generation, e.g. a prompt, a scaffold, etc. which is then used to generate domain samples which are evaluated by the judge. It has to learn and explore the judge preferences by studying its scores and submitting new attempts.

Optimizer agent policy input judge hidden rubric policy π output score 1 label per scored output, budget B submit final policy π*, evaluated held-out Fig. 1: An inverse rubric optimization task. The optimizer agent iteratively submits a policy (e.g. a generation prompt) that maps task inputs to outputs; a black-box judge scores each output against a hidden rubric, spending one label per score. The agent finally submits its best policy, which is evaluated on held-out inputs.

... continue reading