Benchmark: Lab-Play Early signs of life for production automation Lab-play is a highly constrained environment, where agents are given a fixed set of resources and a single target entity to maximize production throughput. This simple setting has only a tiny fraction of the complexity of open-play, where agents spawn in a procedurally generated map and must achieve a complex goal given no starting inventory and sparser resources. Agents write Python using the FLE API to interact with the game, and observe the standard outputs and error messages from their execution. We replicate the methodology from the original FLE paper for the lab-play setting to evaluate the strongest models as of September 2025. The standardized agent harness is minimal: it continuously appends environment interactions to a single conversational history, and when the token budget is nearing exhaustion, it invokes the agent to summarize the older history so it can continue reasoning while remaining aware of past interactions. We do not evaluate agents with backtracking and/or reflection logic as we did in FLE 0.2.0, and instead we encourage the community to experiment with more advanced agent designs. Setting Objective: to achieve production throughput targets of 16 per minute for solid items and 250 per minute for fluids. to achieve production throughput targets of 16 per minute for solid items and 250 per minute for fluids. Prompt: documentation of the FLE API, Factorio recipes, and a guide describing common patterns. documentation of the FLE API, Factorio recipes, and a guide describing common patterns. Inventory: a set of useful items for building functional factories. a set of useful items for building functional factories. Max Steps: 64 steps with early stopping upon completion. 64 steps with early stopping upon completion. Reasoning: default settings ( {"enabled": true} ) for models that support reasoning. Model Performance on Lab-Play Tasks (Pass@8 - Throughput Level 1) Open source models have caught up to the SoTA performance observed in v0.2.0 (May 2025), with successes in electronic circuits, steel plate, sulfur and plastic automation. This is consistent with trends showing that the time for open source models to reach parity with closed source results is diminishing. Discussion The latest generation of frontier models continues to advance the state of the art in FLE with substantial improvement compared to FLE v0.2.0 For the first time, models are able to achieve successes in the harder half of tasks which can utilize over a dozen ingredient dependencies. FLE lab-play clearly differentiates between the capabilities of the frontier models. Notably, the rank and performance gaps between the most advanced models (Claude > GPT > Gemini > Grok) is most similar to GDPVal, a novel benchmark recently released to measure progress in automating economically valuable tasks. This is in contrast to many other static exam-like benchmarks including Humanity's Last Exam, AIME 25, GPQA and MMMU where weaker models in FLE achieve higher performance. While successful agents achieve their throughput goals, many rely on semi-manual strategies rather than building robust automation for more complex tasks. This manifests in agents shuttling resources manually and using storage chests as resource buffers, instead of constructing fully automated logistics chains. Although this does achieve progress toward the target, it creates a local optimum where agents shortcut the more difficult but necessary step of full automation. Throughput is challenging to measure consistently due to these buffering effects. Agents can store items in intermediate buffers (like chests or belts) which temporarily satisfy throughput checks without true sustained production. We mitigate this by enforcing a holdout period during eval, in which an agent must leave their factory alone for 60 seconds before we test whether quotas are met. Higher throughput targets would make it infeasible to pass with manual logistics, forcing agents to build proper automation. Although the FLE harness provides a Python namespace for defining helper functions and abstractions, agents rarely leverage this capability. Instead, they rely on primitive, out-of-the-box tools, which limits their ability to scale solutions to more complex tasks. We expect stronger coding models to define their own abstractions commonly in future. Currently, only Gemini 2.5 Pro takes this approach. Agents also struggle to maintain consistent mental models of the factory layout. Misplacement of entities often cascades into larger failures, since the agent is unable to recover efficiently or reorganize the environment once errors occur (see below). Frontier models display different capacities for error recovery Above: Mean error rate across all trajectories for each model at each step number, revealing how error rates evolve throughout the 64-step trajectories. Grok 4 often falls into degenerate debug loops, whereas GPT-5 recovers gracefully. The pattern suggests that error accumulation and recovery remain challenging across models, with most exhibiting elevated error rates in the middle portions of trajectories when factory complexity increases. Error Analysis Common failure patterns can be grouped by type: Syntactic Errors: Invalid Python code, syntax mistakes, or other execution errors that prevent actions from running at all. These errors are significant failures as the agents did not follow the high-level instructions which are expected of coding agents. Invalid Python code, syntax mistakes, or other execution errors that prevent actions from running at all. These errors are significant failures as the agents did not follow the high-level instructions which are expected of coding agents. Semantic Errors: Misuse of FLE commands or tool arguments (e.g., incorrect parameters, misunderstanding documentation), leading to errors in the execution of the action program. These errors imply difficulty in correctly using the API specification given in the system prompt, and are most commonly expressed through exceptions such as TypeError, AttributeError, NameError, etc. Misuse of FLE commands or tool arguments (e.g., incorrect parameters, misunderstanding documentation), leading to errors in the execution of the action program. These errors imply difficulty in correctly using the API specification given in the system prompt, and are most commonly expressed through exceptions such as TypeError, AttributeError, NameError, etc. Pragmatic Errors: Incorrect reasoning about the current game state, such as attempting to insert items that are not present in the inventory. These errors are the most common category of failures as the game state is dynamic and challenging to model implicitly. Incorrect reasoning about the current game state, such as attempting to insert items that are not present in the inventory. These errors are the most common category of failures as the game state is dynamic and challenging to model implicitly. Planning and Control Errors: Even when primitives are known, agents fail to chain actions coherently, resulting in inefficient or incomplete trajectories. Note: This category cannot be reliably quantified through automated trajectory analysis, as it requires evaluating higher-level strategic coherence rather than individual error types. Failure modes vary across frontier models Above: Relative error frequencies across all evaluation tasks for proprietary frontier models. Mean Error Rates: Claude Opus 4.1: 22.99% | GPT-5: 25.05% | Gemini 2.5 Pro: 27.29% | Grok 4: 40.89% The distribution of errors reveals notable model-specific patterns. Claude Opus 4.1 stands out with zero syntactic errors and almost entirely pragmatic errors (97.7%), indicating strong code generation but difficulties maintaining accurate mental models of game state. All other models - Gemini 2.5 Pro, Grok 4, and GPT-5 - exhibit API misunderstandings at noticeable rates (12-17%), suggesting challenges with correctly using the FLE API documentation. Additionally, GPT-5 and Grok 4 show surprisingly high rates of syntactic errors (21% and 17% respectively), failing to generate valid Python code more frequently than we might expect from frontier models with SoTA coding benchmark performance.