Learning the world through observation
Why do we care about next-frame prediction or next-token prediction as pretraining tasks? Because they are simple objectives that require models with very little built-in knowledge to learn how the world works directly from data. Pretraining reduces uncertainty about what comes next in a sequence—whether that’s a frame or a word. As that uncertainty drops, intelligent capabilities begin to emerge.
This is easy to see with language. Given only “=”, next-token prediction is ill-posed. Given “2+3=”, the next token is nearly deterministic. Train a language model on enough of these sequences and the predictions become low-entropy. Some capabilities emerge from scale alone, but others only appear once training sequences are long enough to include the information needed to resolve uncertainty.
Learning not just video, but the actions that shape it
The same logic applies to world models. To predict the next observation, a world model has to infer the underlying state of the world and how that state evolves over time. In practice, the best source for this is large-scale, general video. This pushes the model to learn structure about physics, causality, and persistence.
Learning long horizons and hidden state
This becomes especially clear in long-horizon settings. Imagine someone starts running a bath, leaves the room for several minutes, and then comes back. While the bath is out of view, the water level continues to rise, the temperature changes, and the tub may eventually overflow. To make a sensible prediction when the person returns, the model has to maintain an internal state of the world and reason about how that state evolved while it was unobserved.
There are a few ways we believe to get this behavior. One is to build in explicit mechanisms for memory or state tracking. The other is to train on sequences long enough that remembering and updating hidden state is required to reduce predictive uncertainty. Short sequences don’t force this: if forgetting carries no cost, long-term structure won’t be learned.
If we want world models that learn the world observation-by-observation—and remain coherent over tens of minutes or hours—we need training data and training procedures that span those horizons. We’ve already seen how this plays out in language: extending context length and improving sequence modeling unlocked capabilities that were not apparent at shorter horizons. World models are earlier on the same trajectory. As data, architectures, and training algorithms are pushed to longer temporal scales, we should expect similar step-changes in their ability to represent persistent state, causality, and long-horizon dynamics. This is incredibly exciting.
From narrow simulation to general simulation
... continue reading