We designed FDM-1, a foundation model for computer use. FDM-1 is trained on videos from a portion of our 11-million-hour screen recording dataset, which we labeled using an inverse dynamics model that we trained. Our video encoder can compress almost 2 hours of 30 FPS video in only 1M tokens. FDM-1 is the first model with the long-context training needed to become a coworker for CAD, finance, engineering, and eventually ML research, and it consistently improves with scale. It trains and infers directly on video instead of screenshots and can learn unsupervised from the entirety of the internet.
Before today, the recipe for building a computer use agent was to finetune a vision-language model (VLM) on contractor-annotated screenshots of computer use, then build reinforcement learning environments to learn each specific downstream task. Agents trained this way are unable to act on more than a few seconds of context, process high-framerate video, do long-horizon tasks, or scale to competent agents.
Moreover, training these VLMs requires contractor-labeled annotations. These are expensive, so current computer action datasets are tiny: the largest open dataset is less than 20 hours of 30 FPS video. Meanwhile, millions of hours of film editing, coding livestreams, video game playthroughs, and more have accumulated on the internet over the past two decades. Building a general computer agent requires an internet-scale video corpus, just as building GPT-3 required an internet-scale text corpus. FDM-1 is the first model that can train at this scale.
Here are some demos of our model doing CAD, driving a car, and fuzzing a website!
0:00 / 0:00 0:00 Figure 1 : FDM-1 extrudes faces on an n-gon to make a gear in Blender. Demo created using a forking VM. (click here for details) Computer-Aided Design × FDM-1 completes continuous mouse movements to do basic CAD tasks. We create OS checkpoints at successful operations (extrude, select, etc.), which unlocks test-time compute for computer use tasks. At the end of the video, we show full model generations on a variety of tasks. 0:00 / 0:00 0:00 Figure 2 : Using arrow keys, FDM-1 autonomously drives a car after less than 1 hour of finetuning data. (click here for details) Self Driving × FDM-1 generalizes beyond computer screens to the real world! After fine-tuning on less than 1 hour of collected data, the model uses key presses to navigate turns around a block in San Francisco. We forked openpilot’s “joystick mode” to control the vehicle and built a website for remote steering via arrow keys. The website displays live video feeds alongside steering angle, brake, and acceleration data. The model executes turns and corrects back to straight-line steering around the block. Fine-tuning FDM-1 substantially outperforms initializing from scratch on our self-driving task. 0:00 / 0:00 0:00 Figure 3 : FDM-1 is uniquely good at fuzzing. Here, it finds a bug in a mock banking app by exploring as many unique states as possible. (click here for details) Automated UI Testing × FDM-1 is unusually capable at “fuzzing” GUIs—finding bugs that require deep exploration of the state tree or strange GUI interactions. Fuzzing cannot be done with random walks or key presses because they do not sufficiently emulate the actions a human would take. We demonstrate this in a toy environment where we use our forking VM infrastructure to explore as many unique states as possible in a banking app, forking when a meaningfully new state has been explored. The model finds a bug where the “Submit Wire Transfer” button is clickable right after a wire transfer has already been completed, which allows the account’s balance to go negative.
To train on all this video, you need to label it with actions like key presses and mouse movements. Prior literature has explored automatically labeling data: in Behavior Cloning from Observation, the researchers taught an “inverse dynamics model” (IDM) to label what action was taken between before states and after states in various simulated environments. IDM-labeling is possible for computer use datasets because mouse movement and typing actions are often easily inferable from the screen: if a “K” shows up, you can be reasonably confident the “K” key was pressed. [1] 1. There are harder examples (e.g. a Cmd+V from an earlier Cmd+C) but looking at minutes of history lets us accurately label long-range inverse dynamics, so we can have high confidence in the sequence of actions that produced a given computer state for almost any video. OpenAI’s Video PreTraining (VPT) paper was the first to apply this method at scale, bootstrapping a Minecraft-specific IDM on a small amount of contractor data to create a competent Minecraft agent with six seconds of context. [2] 2. https://arxiv.org/pdf/2510.19 VideoAgentTrek also trained a computer action IDM to label data. The key problem here is they don’t have video context (cannot do Blender or any continuous tasks) and instead rely on screenshot-action-CoT triplets.
VPT’s architecture was able to learn complex behaviors, something still inaccessible to VLM-based approaches. Unlike VPT, however, complex design, finance, and general computer use require not just six seconds, but minutes to hours of context.
The missing piece is a video encoder. VLMs burn a million tokens to understand just one minute of 30 FPS computer data. Our video encoder encodes nearly 2 hours of video in the same number of tokens—that’s 50x more token-efficient than the previous state-of-the-art and 100x more token-efficient than OpenAI’s encoder. These improvements in context length and dataset size mean we can finally pretrain on enough video to scale computer action models.
Training Recipe
Our training recipe consists of three stages (see Figure ?). First, we train an IDM on 40,000 hours of contractor-labeled screen recordings. Second, we use the IDM to label our 11-million-hour video corpus. Finally, we use the IDM-labeled videos to autoregressively train a “forward dynamics model” (FDM) on next action prediction. The FDM’s output token space consists of key presses and mouse movement deltas, expressive enough to model any action taken on a computer.
... continue reading