How to scale RL to 10^26 FLOPs
TLDR: Reinforcement learning (RL) is the next training technique for building frontier-level AI models. To make it better, we need to train on more data. The current approach of scaling many environments simultaneously is messy and complicated. Instead, I propose we find a way to do next-token prediction on the Web using RL. This way, we learn to reason from general web data, instead of just math and code. I’ve spent a good part of the past year in denial. I was in denial because when OpenAI r