Segmenting Robot Video into Actionable Subtasks

TL;DR We introduce WGO‑Bench, a new benchmark for testing robotics subtask annotation performance across 100 egocentric and robot-video episodes with 743 annotated segments spanning 62 unique high-level task instructions.

egocentric and robot-video episodes with annotated segments spanning unique high-level task instructions. We ran over 60 experiments to find the best subtask annotation pipeline: the best subtask segmentation method reaches 0.306 F1 , subtask labeling reaches 61.0% accuracy , and the best end-to-end pipeline reaches 0.168 F1 .

experiments to find the best subtask annotation pipeline: the best subtask segmentation method reaches , subtask labeling reaches , and the best end-to-end pipeline reaches . Gemini models are undisputed best for this task, with the best model ( Gemini 3.5 Flash ) outperforming the best non-Gemini model (GPT-5.5) by 24.5% .

) outperforming the best non-Gemini model (GPT-5.5) by . Our best end-to-end method uses contact sheets to keep inference cheap, costing $2.64 per hour of video (batch pricing), or roughly 19x less than human annotation .

(batch pricing), or roughly . The full pipeline is open source and implemented in Refiner; see the ready-to-use subtask annotation example to run it on your own videos. Why annotate subtasks? Imagine walking into a kitchen you have never seen before with an instruction: "Make me goulash." If you have never cooked it, you will need to learn it. To do so, you need more than the final instruction; you need the steps, the objects, and where to find them: open the left-most shelf, take out the cutting board, place it on the counter, pick up an onion, peel it, put it on the board, chop it, and so on. Robot learning has a similar problem. To teach robots new long-horizon tasks, we need more than weak high-level instructions. For a robotics demonstration video, the useful signal is which subtask is happening at each moment, and where one subtask ends and the next begins. Subtasks are becoming a central learning signal in recent robotics work. Zawalski et al. (2025)Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, Sergey Levine. (2025). Robotic Control via Embodied Chain-of-Thought Reasoning. https://arxiv.org/abs/2407.08693 uses subtasks together with chain-of-thought reasoning between plans and actions. The recent π series (⁠Physical Intelligence et al., 2025Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, et al. (2025). $\pi_0.5$: a Vision-Language-Action Model with Open-World Generalization. https://arxiv.org/abs/2504.16054⁠) and RT‑H (⁠Belkhale et al., 2024Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, et al. (2024). RT-H: Action Hierarchies Using Language. https://arxiv.org/abs/2403.01823⁠) use semantic subtask prediction alongside low-level action learning, with both showing substantial gains from this extra supervision. Subtasks are also useful beyond direct policy training: SARM (⁠Kim et al., 2025Changyeon Kim, Minho Heo, Doohyun Lee, Jinwoo Shin, Honglak Lee, Joseph J. Lim, et al. (2025). Subtask-Aware Visual Reward Learning from Segmented Demonstrations. https://arxiv.org/abs/2502.20630⁠) uses them for reward modeling.

subtask prediction continuous actions "pick up the pillow" pre-trained VLA "clean the bedroom" high-level prompt "pick up the pillow" low-level command -1.7 1.25 3.14 1.42 action expert (300M) noise In π0.5 , the VLA first predicts a semantic subtask from the observation and overall prompt, then predicts a low-level action chunk conditioned on that subtask through the flow-matching action expert.

As robotics data collection continues to scale, we need annotation pipelines that can keep up. Paying human annotators to watch every hour of video quickly stops being feasible. Despite the promising results, there is little public material on how to mine subtask annotations at scale. The closest public writeup we found is Scale's dense video captioning post (⁠Choghari et al., 2026Choghari, Jade, Sansone, Agustin, Pasqualis, Nicolas, Mader, Conrado, Tiupikov, Aleks, Sivapurapu, Mouli. (2026). The Path to Large Scale Dense Video Captioning. https://labs.scale.com/blog/path-to-large-scale-dense-video-captioning⁠), but it focuses on hand/egocentric manipulation videos only and starts from already separated clips. For robotics, that skips two harder problems: taking a raw episode and deciding where one subtask ends and the next begins, and testing whether the same methods transfer from egocentric video to robot-camera settings. To fill this gap, we created a scalable pipeline to have models annotate subtasks without any human intervention, costing $2.64 per hour of video (batch pricing), making it roughly 19x cheaper than humans. This post shares the lessons we learned from this effort, including the best end-to-end method we found for mining subtasks from both egocentric and robot videos, as well as our new benchmark for robotics subtask annotation: WGO‑Bench (What's Going On Bench). The full pipeline is open-sourced in Refiner, our robotics data processing framework. To run it on your own data, see the ready-to-use example code. Measuring Progress: WGO‑Bench To iterate and choose the best approach, we needed a benchmark. Instead of directly training and evaluating robot policies on every candidate method, which would be very slow and expensive, we built a new benchmark, WGO‑Bench, to directly measure how close VLMs can get to human annotator performance, which are still employed for most of the current industrial efforts. Benchmark composition We collected and manually annotated 100 episodes spanning head-camera recordings from Galaxea World (⁠Jiang et al., 2025Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, et al. (2025). Galaxea Open-World Dataset and G0 Dual-System VLA Model. https://arxiv.org/abs/2509.00576⁠), third-person camera views of station-arm manipulation from DROID (⁠Khazatsky et al., 2025Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, et al. (2025). DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset. https://arxiv.org/abs/2403.12945⁠), and egocentric videos from HomER (⁠Toloka, 2026Toloka. (2026). HomER v2: Home Egocentric Robotics Dataset. Hugging Face.⁠) to create WGO‑Bench, a diverse subtask annotation benchmark. In total, it contains 743 annotated segments across 62 unique high-level task instructions.

Section Type Viewpoint Samples Unique tasks Total duration Avg ep len Resolution Segments HomER Human Egocentric 25 17 39.2 min 94.0s Mixed, mostly 1920x1080 / 848x480 470 DROID Robot External robot camera 50 26 24.9 min 29.9s 320x180 150 Galaxea Robot Robot head camera 25 19 7.4 min 17.7s 1280x720 123 Total Mixed Mixed 100 62 71.5 min 42.9s Mixed 743 WGO-Bench sample breakdown

Manually annotating subtasks We manually annotated WGO‑Bench demonstrations following a strict annotation protocol: segments are atomic manipulation events, boundaries follow object-state changes, and labels must be self-contained enough to train policies without relying on previous actions. Atomic event One subtask should describe one completed manipulation event. 1 / 3 clip 0 02.8 05.7 08.1 Wrong Pick, move, place Correct Pick Place now Subtasks should be atomic: one completed pick, one completed place, not a combined pick-move-place motion. Annotation policy examples from galaxea_069

Annotation protocol details + The episodes were segmented into atomic manipulation events rather than motion fragments. A subtask ends when the event is complete, not when the robot hand returns to a neutral pose. Unless there is a clear pause, the next subtask starts immediately after the previous one. Boundaries are placed at object-manipulation changes: when an object becomes held, is released, reaches a new location, or a door or lid changes state. Camera motion, hesitation, and tiny hand adjustments are not separate subtasks. Labels are self-contained. They do not refer to previous human or robot actions, and they describe the manipulated object and target location as precisely as possible: not "put the cup on the table", but "put the cup on the table next to the bowl." This prevents ambiguity because most robotic policies do not take past frames or actions as input.

... continue reading