Does RL scale? Over the past few years, we've seen that next-token prediction scales, denoising diffusion scales, contrastive learning scales, and so on, all the way to the point where we can train models with billions of parameters with a scalable objective that can eat up as much data as we can throw at it. Then, what about reinforcement learning (RL)? Does RL also scale like all the other objectives? Apparently, it does. In 2016, RL achieved superhuman-level performance in games like Go and Chess. Now, RL is solving complex reasoning tasks in math and coding with large language models (LLMs). This is great. However, there is one important caveat: most of the current real-world successes of RL have been achieved with on-policy RL algorithms (e.g., REINFORCE, PPO, GRPO, etc.), which always require fresh, newly sampled rollouts from the current policy, and cannot reuse previous data (note: while PPO-like methods can technically reuse data to some (limited) degree, I'll classify them as on-policy RL, as in OpenAI's documentation). This is not a problem in some settings like board games and LLMs, where we can cheaply generate as many rollouts as we want. However, it is a significant limitation in most real-world problems. For example, in robotics, it takes more than several months in the real world to generate the amount of samples used to post-train a language model with RL, not to mention that a human must be present 24/7 next to the robot to reset it during the entire training time!
On-policy RL can only use fresh data collected by the current policy \(\pi\). Off-policy RL can use any data \(\mathcal{D}\).
This is where off-policy RL comes to the rescue. In principle, off-policy RL algorithms can use any data, regardless of when and how it was collected. Hence, they generally lead to much better sample efficiency, by reusing data many times. For example, off-policy RL can train a dog robot to walk in 20 minutes from scratch in the real world. Q-learning is the most widely used off-policy RL algorithm. It minimizes the following temporal difference (TD) loss: $$\begin{aligned} \mathbb{E}_{(s, a, r, s') \sim \mathcal{D}} \bigg[ \Big( Q_\theta(s, a) - \big(r + \gamma \max_{a'} Q_{\bar \theta}(s', a') \big) \Big)^2 \bigg], \end{aligned}$$ where \(\bar \theta\) is the parameter of the target network. Most practical (model-free) off-policy RL algorithms are based on some variants of the TD loss above. So, to apply RL to many real-world problems, the question becomes: does Q-learning (TD learning) scale? If the answer is yes, this would lead to at least an equivalent level of impact as the successes of AlphaGo and LLMs, enabling RL to solve far more diverse and complex real-world tasks very efficiently, in robotics, computer-using agents, and so on. Q-learning is not yet scalable Unfortunately, my current belief is that the answer is not yet. I believe current Q-learning algorithms are not readily scalable, at least to long-horizon problems that require more than (say) 100 semantic decision steps. Let me clarify. My definition of scalability here is the ability to solve more challenging, longer-horizon problems with more data (of sufficient coverage), compute, and time. This notion is different from the ability to solve merely a larger number of (but not necessarily harder) tasks with a single model, which many excellent prior scaling studies have shown to be possible. You can think of the former as the "depth" axis and the latter as the "width" axis. The depth axis is more important and harder to push, because it requires developing more advanced decision-making capabilities. I claim that Q-learning, in its current form, is not highly scalable along the depth axis. In other words, I believe we still need algorithmic breakthroughs to scale up Q-learning (and off-policy RL) to complex, long-horizon problems. Below, I'll explain two main reasons why I think so: one is anecdotal, and the other is based on our recent scaling study.
Both AlphaGo and DeepSeek are based on on-policy RL and do not use TD learning.
Anecdotal evidence first. As mentioned earlier, most real-world successes of RL are based on on-policy RL algorithms. AlphaGo, AlphaZero, and MuZero are based on model-based RL and Monte Carlo tree search, and do not use TD learning on board games (see 15p of the MuZero paper). OpenAI Five achieves superhuman performance in Dota 2 with PPO (see footnote 6 of the OpenAI Five paper). RL for LLMs is currently dominated by variants of on-policy policy gradient methods, such as PPO and GRPO. Let me ask: do we know of any real-world successes of off-policy RL (1-step TD learning, in particular) on a similar scale to AlphaGo or LLMs? If you do, please let me know and I'll happily update this post. Of course, I'm not making this claim based only on anecdotal evidence. As said before, I'll show concrete experiments to empirically prove this point later in this post. Also, please don't get me wrong, I'm still highly optimistic about off-policy RL and Q-learning (as an RL researcher who mainly works in off-policy RL!). I just think that we are not there yet, and the purpose of this post is to call for research in RL algorithms, rather than to discourage it! What's the problem? Then, what fundamentally makes Q-learning not readily scalable to complex, long-horizon problems, unlike other objectives? Here is my answer: $$\begin{aligned} \definecolor{myblue}{RGB}{89, 139, 231} \mathbb{E}_{(s, a, r, s') \sim \mathcal{D}} \bigg[ \Big( Q_\theta(s, a) - \underbrace{\big(r + \gamma \max_{a'} Q_{\bar \theta}(s', a') \big)}_{{\color{myblue}\texttt{Biased }} (\textit{i.e., }
eq Q^*(s, a))} \Big)^2 \bigg] \end{aligned}$$ Q-learning struggles to scale because the prediction targets are biased, and these biases accumulate over the horizon. The presence of bias accumulation is a fundamental limitation that is unique to Q-learning (TD learning). For example, there are no biases in prediction targets in other scalable objectives (e.g., next-token prediction, denoising diffusion, contrastive learning, etc.) or at least these biases do not accumulate over the horizon (e.g., BYOL, DINO, etc.).
Biases accumulate over the horizon.
As the problem becomes more complex and the horizon gets longer, the biases in bootstrapped targets accumulate more and more severely, to the point where we cannot easily mitigate them with more data and larger models. I believe this is the main reason why we almost never use larger discount factors (\(\gamma > 0.999\)) in practice, and why it is challenging to scale up Q-learning. Note that policy gradient methods suffer much less from this issue. This is because GAE or similar on-policy value estimation techniques can deal with longer horizons relatively more easily (though at the expense of higher variance), without strict 1-step recursions. Empirical scaling study In our recent paper, we empirically verified the above claim via diverse, controlled scaling studies. We wanted to see whether current off-policy RL methods can solve highly challenging tasks by just scaling up data and compute. To do this, we first prepared highly complex, previously unsolved tasks in OGBench. Here are some videos:
cube
cube
puzzle
puzzle
humanoidmaze
humanoidmaze
These tasks are really difficult. To solve them, the agent must learn complex goal-reaching behaviors from unstructured, random (play-style) demonstrations. At test time, the agent must perform precise manipulation, combinatorial puzzle-solving, or long-horizon navigation, over 1,000 environment steps. We then collected near-infinite data on these environments, to the degree that overfitting is virtually impossible. We also removed as many confounding factors as possible. For example, we focused on offline RL to abstract away exploration. We ensured that the datasets had sufficient coverage, and that all the tasks were solvable from the given datasets. We directly provided the agent with the ground-truth state observations to reduce the burden of representation learning. Hence, a "scalable" RL algorithm must really be able to solve these tasks, given sufficient data and compute. If Q-learning does not scale even in this controlled setting with near-infinite data, there is little hope that it will scale in more realistic settings, where we have limited data, noisy observations, and so on.
Standard offline RL methods struggle to scale on complex tasks, even with \(1000\times\) more data.
So, how did the existing algorithms work? The results were a bit disappointing. None of the standard, widely used offline RL algorithms (flow BC, IQL, CRL, and SAC+BC) were able to solve all of these tasks, even with 1B-sized datasets, which are \(1000 \times\) larger than typical datasets used in offline RL. More importantly, their performance often plateaued far below the optimal performance. In other words, they didn't scale well on these complex, long-horizon tasks. You might ask: Are you really sure these tasks are solvable? Did you try larger models? Did you train them for longer? Did you try different hyperparameters? And so on. In the paper, we tried our best to address as many questions as possible with a number of ablations and controlled experiments, showing that none of these fixes worked... except for one: Horizon reduction makes RL scalable Recall that my claim earlier was that the horizon (and bias accumulation thereof) is the main obstacle to scaling up off-policy RL. To verify this, we tried diverse horizon reduction techniques (e.g., n-step returns, hierarchical RL, etc.) that reduce the number of biased TD backups.
Horizon reduction was the only technique we found that substantially improved scaling.