Published June 9, 2025 Email [email protected] Kevin Black, Manuel Y. Galliker, Sergey Levine
Unlike chatbots or image generators, robots must operate in real time. While a robot is “thinking”, the world around it evolves according to physical laws, so delays between inputs and outputs have a tangible impact on performance. For a language model, the difference between fast and slow generation is a satisfied or annoyed user; for a vision-language-action model (VLA), it could be the difference between a robot handing you a hot coffee or spilling it in your lap. While VLAs have achieved promising results in open-world generalization, they can be slow to run. Like their cousins in language and vision, these models have billions of parameters and require heavy-duty GPUs. On edge devices like mobile robots, that adds even more latency for network communication between a centralized inference server and the robot1. To build a real-time system with VLAs, we are going to need some form of asynchrony: that is, we must let a model think about its future actions while executing a previous one. Action chunking — where a robot outputs and executes a sequence of multiple actions for each inference call — provides a good starting point. Our VLAs all use a chunk size of 50 actions, corresponding to 1 second of real time. However, chunking alone is not enough. When we switch chunks, the new actions might not “agree” with the old ones, causing discontinuities and unsafe accelerations. Trying to naively smooth over these discontinuities is not guaranteed to produce valid actions, and can have disastrous consequences.
As a result, in π 0 , π 0 -FAST, and π 0.5 , we did not use a real-time strategy2. We executed actions synchronously, meaning that we would finish executing one chunk, wait for model inference, and then begin executing the next one. This way, the robot started each chunk from rest, and we circumvented the issues that arise from switching chunks while in motion. However, this introduces pauses between chunks that are still harmful — these pauses aren't in the training data. Not to mention: they slow things down, are ugly to look at, and discourage us from scaling up the size of our models. To solve these problems, we developed an algorithm that we call real-time chunking (RTC). It enables real-time execution without discontinuities, and it works on any diffusion- or flow-based VLA — including π 0.5 — with no training-time changes. We found that RTC significantly sped up execution time for all the tasks we tested. It was also very robust to latency, even when we injected artificial delays to cause much higher latencies than normal. RTC completed dynamic and precise tasks, like striking a match or plugging in an Ethernet cable, with inference delays of more than 300 milliseconds. Thinking while moving
replay Action chunk 1 Action chunk 2
When there is an inference delay, real-time execution requires careful consideration. While a new chunk is generated (red), the previous chunk (green) continues executing. If the new chunk is substantially different — which it often is — switching to the new chunk results in disaster.
The core challenge of real-time execution is to maintain consistency between action chunks. By the time a new chunk has been generated by the model, the previous one has already been executed partway. Without a specialized algorithm, the new chunk might be totally incompatible with the robot's current trajectory — the model might be reacting to new information, or it might just be sampling a different “strategy” from its learned distribution of behaviors. Thus, we must somehow use the overlapping timesteps where we have access to the remaining actions of the previous chunk. A good real-time algorithm should produce a new chunk that is consistent with these overlapping actions while still preserving the model's reactivity to new information and ability to make intelligent decisions. Our key insight is to pose real-time chunking as an inpainting problem. Let's say that our model takes 3 controller timesteps to produce an action chunk after receiving an observation. The first 3 actions of a new chunk can't be executed, since those timesteps will have already passed by the time the new chunk is available. Thus, it makes sense to “freeze” those actions to the values from the previous chunk that we know will be executed; our goal is then to fill in the remainder of the new chunk, much like inpainting a section of an image that has been removed. Depending on the chunk size and how often we run inference, there will be some number of actions beyond the first 3 that also overlap with the previous chunk. Rather than ignoring them completely and starting fresh, it also makes sense to partially attend to these middle actions — encouraging the model to keep a consistent strategy, but allowing it to make updates based on new information.
Luckily, diffusion and flow models happen to be really good at image inpainting, even without being trained for it. By adapting these algorithms to our setting and adding our “partial attention” idea on top, we can solve the chunk consistency problem without any training-time changes. This means we can easily apply our method directly on top of models like π 0 and π 0.5 , benefiting from all the research that goes into training while executing in real time! Precision and speed with high-latency models