An FAQ on Reinforcement Learning Environments

This post is a collaboration between guest author Chris Barber and JS Denain from Epoch AI.

Reinforcement learning (RL) environments have become central to how frontier AI labs train their models. In September 2025, The Information reported that Anthropic had discussed spending over $1 billion on RL environments over the following year. As Andrej Karpathy put it in his 2025 year-in-review: by training LLMs on a wide range of verifiable tasks across different environments, “the LLMs spontaneously develop strategies that look like ‘reasoning’ to humans.”

This wave of RL for capabilities started with OpenAI’s o1, which was trained on math and coding problems with verifiable answers. Since then, labs have expanded the range of tasks they train on, all the while scaling the amount of compute spent on RL training.

Without diverse, high-quality environments and tasks to train on, throwing more compute at RL risks wasting much of it. As a result, creating those tasks and environments has become a key bottleneck for scaling capabilities, and a growing market that remains largely behind closed doors.

To understand the emerging industry of building environments and tasks that labs use to RL-train their models, we interviewed1 18 people across RL environment startups, neolabs, and frontier labs. We asked them what RL environments and tasks look like, how labs use them, what makes a good one, and where the field is headed.

Main takeaways:

Enterprise workflows are a major growth area . Math and coding tasks came first, but we’re now seeing significant growth in enterprise workflows: tasks like navigating Salesforce, filing reports, or manipulating spreadsheets.

. Math and coding tasks came first, but we’re now seeing significant growth in enterprise workflows: tasks like navigating Salesforce, filing reports, or manipulating spreadsheets. Reward hacking is a top concern . Interviewees consistently cited robustness against reward hacking as a key quality criterion. Models find ways to game graders, and preventing this requires extensive iteration on both environments and tasks.

. Interviewees consistently cited robustness against reward hacking as a key quality criterion. Models find ways to game graders, and preventing this requires extensive iteration on both environments and tasks. Scaling without sacrificing quality is hard. A major challenge is scaling the quantity of environments and tasks without sacrificing quality. The hard parts are management (coordinating a growing number of task builders) and maintaining good quality assessment processes.

Get the latest from Epoch AI Subscribe

... continue reading