Following the Text Gradient at Scale

RL Throws Away Almost Everything Evaluators Have to Say

When you get feedback on your work, it usually tells you what went wrong and how to fix it. But existing reinforcement learning (RL) algorithms throw most of that information away; it compresses potentially rich feedback into a single number, a reward, then tries to learn by correlating rewards with actions across hundreds or thousands of attempts. We do this because our algorithms were designed for scalar supervision, not because of a fundamental constraint in learning from experience.

To illustrate this, let’s consider a simple example. Suppose you’re judging cakes. You take a bite, you like the shavings on top, the ganache is perfectly tempered, but you want way more cherries throughout. Yet you only record: “4/5.” The baker learns nothing about the cherry distribution or what else worked well, only that this cake scored higher than a 3. If this is the only information you provide, the baker will likely have to do a lot more baking to figure out what you actually want.

More generally, an expert given a candidate solution can articulate specific failure modes, causal mechanisms, and concrete fixes. Full verbal feedback for the cake above would contain far more actionable information than the numerical score “4/5”. The baker can confidently keep the parts of the cake that worked well, rather than blindly exploring recipe variations. This is the core insight: rich feedback enables targeted improvements rather than random exploration, resulting in fewer trials to achieve a better outcome.

This striking mismatch between the information available and the information used by RL has been aptly described as sucking supervision through a straw: you run minutes of rollout and compress it all into a final reward signal broadcast across the entire trajectory. This scalar bottleneck is becoming increasingly costly in the tasks we’re deploying LLMs on: for example, research agents run 5-30 minutes per task. Each run produces rich diagnostic logs—tool calls, intermediate reasoning, error traces—all of which are collapsed into a single scalar that discards the causal signal of where and why things failed. While rich feedback requires a bit more work from the evaluator, they’ve already done the reasoning; we’re just asking them to write it down. When rollouts themselves are expensive, the marginal annotation cost is small relative to the sample-efficiency gains we can achieve with richer feedback.

In this post, we survey an emerging learning paradigm that fully embraces all the feedback an environment has to offer—avoiding the scalar bottleneck of RL—and discuss our recent work called Feedback Descent (check out the paper here), which outperforms specialized RL methods in challenging optimization domains such as molecular design and prompt optimization.

From Scalar Rewards to Text-Based Optimization

A growing body of work hints at an alternative to reward-based learning: directly using rich feedback to guide model improvement. Given a textual artifact (e.g., a prompt, source code, molecule specs), we can often provide natural-language explanations of how to improve it. That explanation is already a form of supervision. Rather than compressing it into a single number, we can feed it back into the system during the update.

In recent work, two broad patterns have emerged around this principle:

Critique-based or “text gradient” methods. The model proposes an artifact and receives a natural-language critique of its errors or omissions. The critique explicitly suggests a direction of improvement: adjust the retrieval query, remove this redundancy, change the control flow, etc. A revised artifact is then produced by editing the original in line with the critique. This pattern appears in systems such as Self-Refine, APO, Trace, and TextGrad.

... continue reading