Meta’s DreamGym framework trains AI agents in a simulated world to cut reinforcement learning costs

Researchers at Meta, the University of Chicago, and UC Berkeley have developed a new framework that addresses the high costs, infrastructure complexity, and unreliable feedback associated with using reinforcement learning (RL) to train large language model (LLM) agents. The framework, DreamGym, simulates an RL environment to train agents for complex applications. As it progresses through the training process, the framework dynamically adjusts task difficulty, ensuring the agent gradually learns to solve more challenging problems as it improves.Experiments by the research team show that DreamGym substantially improves RL training in both fully synthetic settings and scenarios where the model must apply its simulated learning to the real world. In settings where RL is possible but expensive, it matches the performance of popular algorithms using only synthetic interactions, significantly cutting the costs of data gathering and environment interaction. This approach could be vital for enterprises, allowing them to train agents for bespoke applications while avoiding the complexities of setting up and running live RL environments.The challenge of training LLM agentsReinforcement learning is a key technique for training LLMs to handle complex tasks in agentic environments, such as web navigation, tool use, and robotics. It allows models to learn from direct interaction and experience, moving beyond the static datasets used in pre-training.However, RL for agent training remains difficult. Real-world applications often involve long action sequences with sparse signals, meaning the agent only receives a positive signal after a long and correct sequence of actions. Gathering enough diverse and validated data is also expensive, frequently requiring human experts to verify tasks and annotate outcomes. And the infrastructure required to create the live environments for large-scale RL training can be prohibitively complex and costly. Not to mention that interacting with live systems carries risks, as wrong actions (like deleting a file) can cause irreparable damage. “These limitations make building general-purpose and scalable systems for training agents with RL an open and pressing challenge,” the researchers write.DreamGym directly challenges that model by delivering comparable performance entirely in simulation, removing the infrastructure burden that has kept most enterprises from adopting RL — and giving teams a practical path to train agents without touching costly or risky live environments.How DreamGym worksThe researchers describe DreamGym as a “unified and scalable RL framework that synthesizes diverse experience data in an online manner to enable efficient and effective training of LLM agents.” It is built around three core components that work together to create a controlled and effective training loop.The first component is a “reasoning-based experience model” that translates the dynamics of a target environment into a textual space. This model acts as the simulator of the application environment. Instead of interacting with a costly real environment, the agent interacts with this model, which generates consistent state transitions and feedback based on the agent’s actions. The researchers argue that agent training doesn't need perfectly realistic environments, but rather data that is "sufficiently diverse, informative, and causally grounded." For example, in a web shopping task, the model synthesizes clean listings of on-page elements rather than processing raw HTML code. This abstract approach makes training the experience model highly efficient, requiring only a small amount of public data.The second component is an “experience replay buffer,” which acts as a dynamic memory. At the beginning of the training process, the buffer is seeded with offline data to provide essential context and is continuously updated with new synthetic trajectories generated during training. This buffer helps guide the experience model's predictions, ensuring the synthetic experiences remain diverse and factually grounded. The third component, a “curriculum task generator,” works in tandem with the experience model to adaptively create new tasks that are progressively more challenging. It identifies tasks where the agent's performance is mixed (signaling they are difficult but solvable) and generates variations to push the agent's capabilities.Together, these components create a closed-loop system for scalable agent training. “By unifying interaction, memory, and adaptive online task generation, DreamGym addresses the persistent challenges that have limited RL for LLM agents training: prohibitive cost, scarcity of diverse tasks, unstable reward signals, and heavy infrastructure demands,” according to the researchers.DreamGym in actionThe researchers evaluated DreamGym across several agent benchmarks, including WebShop (e-commerce), ALFWorld (embodied control), and WebArena (realistic web interaction). They used Llama 3 and Qwen 2.5 models as agent backbones and compared DreamGym against several traditional training strategies. These included offline methods like supervised fine-tuning (SFT) and direct preference optimization (DPO), as well as online RL algorithms like Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), which improve agents through live environment interaction.DreamGym showed its most significant advantage in environments like WebArena, where setting up a large-scale RL infrastructure is difficult. Agents trained entirely inside DreamGym achieved success rates over 30% higher than baseline methods, which struggled with the sparse rewards and limited exploration in the real environment. The researchers said this shows DreamGym is a mechanism that makes RL training “feasible in domains that were previously intractable due to inherent task and engineering constraints.”In environments where RL is supported but costly, agents trained with DreamGym performed on par with those trained using GRPO and PPO, but without any costly interactions with the external environment. The team also introduced a sim-to-real approach, DreamGym-S2R, where an agent is first trained in the synthetic environment and then fine-tuned on a small amount of real-world data. This strategy yielded over a 40% performance improvement compared to training from scratch in the real environment while using less than 10% of the external data. This provides a scalable "warm-start" for training general-purpose agents.Finally, the framework demonstrated strong generalization. An agent trained on tasks in one domain, such as WebShop, could successfully transfer its learned skills to another, like WebArena. The researchers suggest this is because DreamGym agents learn in an "abstract meta-representation space, enabling the agent to learn domain-agnostic behavioral priors rather than memorizing task-specific patterns."While still in its early stages, DreamGym shows that simulated environments can provide great gains in training agents. In practice, an enterprise could gather a small amount of trajectories and descriptions for the tasks it wants to automate. It can then use this small seed to bootstrap the DreamGym frameworks for the scalable and sample-efficient training of agents.