MIT’s new ‘recursive’ framework lets LLMs process 10 million tokens without context rot

Recursive language models (RLMs) are an inference technique developed by researchers at MIT CSAIL that treat long prompts as an external environment to the model. Instead of forcing the entire prompt into the model's context window, the framework allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the text.Rather than expanding context windows or summarizing old information, the MIT team reframes long-context reasoning as a systems problem. By letting models treat prompts as something they can inspect with code, recursive language models allow LLMs to reason over millions of tokens without retraining. This offers enterprises a practical path to long-horizon tasks like codebase analysis, legal review, and multi-step reasoning that routinely break today’s models.Because the framework is designed as a wrapper around existing models, it can serve as a drop-in replacement for applications that make direct calls to LLMs.The LLM context problemWhile frontier models are becoming increasingly sophisticated at reasoning, their ability to process massive amounts of information is not scaling at the same rate. This bottleneck is driven by two distinct limitations: the hard physical constraint on how much text a model can process at once (context length) and "context rot."The challenge, the researchers argue, is whether it’s possible to scale the effective context size of general-purpose LLMs by orders of magnitude without retraining them. This capability is becoming increasingly important for enterprise applications, where LLMs are adopted for long-horizon tasks requiring the processing of millions of tokens — a challenge Zhang argues can’t be solved by simply expanding context windows."There is an entropy argument that implies you need exponentially more data samples as you increase the effective context window size," Alex Zhang, a co-author of the paper, told VentureBeat. Current approaches to extending context often rely on compaction, where the model summarizes older parts of the conversation to free up space. However, this method fails for tasks requiring random access to specific details located in earlier parts of the prompt.How RLMs workThe concept behind RLMs is drawn from "out-of-core" algorithms used in classical computing. These algorithms are designed to process datasets too large to fit into a computer's main memory by keeping the data on a hard drive and fetching only the necessary chunks as needed.RLMs apply this logic to generative AI. Instead of feeding a long prompt directly into the neural network, the framework loads the text as a string variable inside a Python coding environment. The LLM is given general context about the data (such as the total character count) but does not "see" the text initially.Once the prompt is stored as a variable, the LLM acts as a programmer. It writes Python code to interact with the external variable, using standard commands to peek into the data. For example, the model might use regular expressions to search for specific keywords like "Chapter 1" or "financial results."When the code execution finds a relevant snippet, the RLM pulls only that specific chunk into its active context window for analysis.For example, if the prompt is a massive book, the LLM might write a loop that identifies chapter boundaries and then triggers a sub-call to summarize each chapter individually.The architecture typically involves two agents. A "root language model," often a capability-heavy model like GPT-5, acts as the orchestrator. It plans the approach, writes the code, and manages the data flow within the REPL environment. A "recursive language model," often a faster and cheaper model, acts as the worker. The root LM calls this worker to process the specific text snippets isolated by the code.Because the prompt resides in the environment's memory rather than the model's context window, the system can handle inputs far larger than the model's training limit. Importantly, to the end-user, the RLM behaves exactly like a standard model: It accepts a string and returns an answer. This allows enterprise teams to swap standard API calls for RLMs. For developers looking to experiment, the RLM code is currently available on GitHub. "A key argument for RLMs is that most complex tasks can be decomposed into smaller, 'local' sub-tasks," Zhang said. "However, how to perform this context/problem decomposition is non-trivial, and the model must be capable of performing this."RLMs in actionTo validate the framework, the researchers tested RLMs against base models and other agentic approaches like CodeAct and summary agents across a variety of long-context tasks, including retrieval and multi-hop question answering.The results demonstrated strong performance gains at the 10 million+ token scale. On BrowseComp-Plus, a benchmark involving inputs of 6 to 11 million tokens, standard base models failed completely, scoring 0%. In contrast, the RLM powered by GPT-5 achieved a score of 91.33%, significantly outperforming the Summary Agent (70.47%) and CodeAct (51%).The framework also excelled at tasks with high computational complexity. On OOLONG-Pairs, an information-dense reasoning benchmark where the difficulty scales quadratically with input length, base GPT-5 models failed catastrophically with a score of just 0.04%. The RLM achieved an F1 score (a balanced measure of precision and recall) of 58%, demonstrating emergent capabilities to handle dense tasks that paralyze standard models. Similarly, on code understanding tasks (CodeQA benchmark), the RLM more than doubled the performance of the base GPT-5 model, jumping from 24% to 62%.Regarding the context rot problem, the data showed that while the base GPT-5 performance degrades rapidly as task complexity increases, RLM performance holds steady, consistently outperforming the base model on contexts longer than 16,000 tokens.Despite the increased complexity of the workflow, RLMs often maintained comparable or lower average costs than the baselines. On the BrowseComp-Plus benchmark, the RLM was up to three times cheaper than the summarization baseline. However, the researchers noted that while median costs are low, RLM trajectories are "long-tailed." Outlier runs can become expensive if the model gets stuck in loops or performs redundant verifications. While GPT-5 was conservative in its sub-calls, the open-source Qwen3-Coder model sometimes attempted thousands of sub-calls for simple tasks."Today, you likely will have to implement your own guardrails and logic to control RLM behavior," Zhang said. However, he hypothesizes that future models could be trained to manage their own compute budgets more effectively. Companies like Prime Intellect are planning to integrate RLM into the training process of models, possibly addressing the edge cases where the model’s inference budget spikes.For enterprise architects deciding where to place their bets, the RLM framework offers a new tool for handling information-dense problems."I think RLMs are still extremely useful for chatbots (think long chat histories), but ultimately they argue for an alternative way of using LMs," Zhang said. "I think RLMs work in tandem with standard retrieval methods like RAG; they do not serve as a replacement, and can be used in different settings or together."