The upcoming GPT-3 moment for RL
Matthew Barnett, Tamay Besiroglu, Ege Erdil
Jun 20, 2025
GPT-3 showed that simply scaling up language models unlocks powerful, task-agnostic, few-shot performance, often outperforming carefully fine-tuned models. Before GPT-3, achieving state-of-the-art performance meant first pre-training models on large generic text corpora, then fine-tuning them on specific tasks.
Today’s reinforcement learning is stuck in a similar pre-GPT-3 paradigm. We first pre-train large models, and then painstakingly fine-tune them on narrow tasks in highly specialized environments. But this approach suffers from a fundamental limitation: the resulting capabilities generalize poorly, leading to brittle performance that rapidly deteriorates outside the precise contexts seen during training.
We think RL will soon have its own GPT-3 moment. Rather than fine-tuning models on a small number of environments, we expect the field will shift toward massive-scale training across thousands of diverse environments. Doing this effectively will produce RL models with strong few-shot, task-agnostic abilities capable of quickly adapting to entirely new tasks. But achieving this will require training environments at a scale and diversity that dwarf anything currently available.
How much RL will this take?
Current RL datasets are relatively small. For example, DeepSeek-R1 was trained on roughly 600k math problems, representing about six years of continuous human effort if each task takes five minutes to complete. By contrast, reconstructing GPT-3’s 300-billion-token training corpus would require on the order of tens of thousands of years of human writing at typical human writing speeds.
Incidentally, achieving RL compute expenditure comparable to current frontier-model pretraining budgets will likely require about ~10k years of model-facing task-time, measured in terms of how long humans would take to perform the same tasks. DeepSeek-R1 used about 6e23 FLOP during the RL stage using about 6 years of model-facing task-time. Assuming future training runs use a similar number of epochs and group sizes as DeepSeek-R1, scaling this to about 6e26 FLOP would imply roughly 6k years of model-facing task-time.
It is unclear whether future RL training will involve larger or smaller group sizes or more epochs, especially as we increase the diversity of task distributions. We don’t have much data on this question, so making precise estimates of the required model-facing task-time remains difficult, though ~10k years seems likely to be the correct order of magnitude.
... continue reading