Skip to content
Tech News
← Back to articles

Stanford's DeLM cuts multi-agent task costs 50% — without a central orchestrator

read original more articles
Why This Matters

Stanford's DeLM introduces a decentralized approach to multi-agent AI systems, eliminating the need for a central orchestrator. This innovation reduces costs and latency by enabling agents to communicate directly through a shared knowledge base, improving scalability and efficiency. It challenges traditional AI frameworks, paving the way for more robust and cost-effective multi-agent reasoning systems in the tech industry.

Key Takeaways

One of the assumptions behind today’s AI frameworks is that agents require a “boss” at the center; this orchestrator runs the show, routes requests, and makes sure the whole system doesn’t descend into chaos. That assumption may be wrong, and the cost of carrying it could be measured in inference dollars and coordination latency. A new Stanford framework called a decentralized language model, or DeLM, is built on the premise that agents can coordinate directly, without routing every update through a central controller.DeLM's shared knowledge base serves as a “common communication substrate” so that agents can build upon one another’s verified progress without having to route every interaction through a main agent to “merge, filter, and rebroadcast,” Yuzhen Mao and Azalia Mirhoseini, co-developers of the framework, explain in a research paper. It’s a system that’s not only possible, but desirable in certain instances. “Agents can build on prior findings, avoid repeated failures, preserve constraints, and recover detailed evidence only when needed.”The challenges of traditional multi-agent systemsIn a typical centralized multi-agent system, a main agent breaks tasks into subtasks, assigns them out to multiple sub-agents in parallel, waits for responses, merges and summarizes intermediate progress, then launches a next wave of orders based on collected context. While this is a natural way to scale LLM reasoning, the Stanford researchers argue that it scales poorly. Every useful finding, partial finding, and failure must be reported back to the main agent, which then determines what information to merge and rebroadcast to the agents below it. “As the number of subtasks grows, this controller becomes a communication and integration bottleneck,” Mao and Mirhoseini write. Further, the main orchestrator may “dilute, omit, or distort” useful information, leading to lost progress. This bottleneck also occurs in long-context reasoning scenarios. Once it receives reports back from subagents, a main agent will typically group related concepts, data points, and other materials together in an unsupervised learning loop. It may then pre-assign these "evidence clusters" to sub-agents before knowing what surfaced material is actually relevant or whether it’s combined correctly. When a subagent receives this insufficient context, it will essentially get confused and return to the main agent, kicking off another retrieval or delegation round. “This back-and-forth makes coordination slower, more iterative, and increasingly constrained by a single overloaded main agent,” the researchers write. What DeLM addresses and how it worksDeLM, by contrast, is built around parallel agents, a shared context, and a task queue. Shared context is essentially a curated store of “gists,” or information summaries that other agents might find useful. These include verified and evidence-based findings alongside partial findings and documented failures; they also point to detailed evidence that agents can pull from based on their specific task. A task queue is then a set of subsequent pending subtasks that agents can claim independently. “Agents write compact, verified updates into a shared context that later agents can read directly,” the researchers write. Useful findings, failures, and constraints accumulate as a “shared problem state,” rather than passing through a central controller.The pipeline looks like this: Initialization: Inputs are broken into different work units and added to a queue; Parallel execution: Agents work independently and in tandem, pulling tasks and reading shared context as they progress. Compression and verification: Results are compressed into reusable “gists” that are checked against supporting evidence. Only gists that are fully verified are shared with the group. Additional work (if needed): When the queue is emptied, the last agent to return an answer inspects all the shared context to determine whether further work is required. Final step: The last agent determines that no more steps are required and returns the final answer. Agents “exchange progress through shared state, asynchronously claim ready tasks, and scale more adaptively as the number of subtasks grows,” the researchers explain. How DeLM performs in the wildWith DeLM, agents can avoid redundant exploration; reuse and build on each other’s discoveries and failures; and focus on unresolved issues.The framework can be particularly useful in software engineering test-time scaling, when models are given time to “think” to improve their reasoning and problem-solving capabilities. Different agents can explore their own hypotheses or pursue reasoning paths in parallel, while still sharing intermediate progress. One example is concurrent de-bugging. DeLM is also suitable for long-context reasoning and multi-document question-answering; agents can simultaneously examine their own evidence clusters (collections of papers, code, or other materials) at the same time, while maintaining a “global compact view” of accumulated evidence. The researchers contend that it makes agentic tasks more accurate and significantly cheaper. This is backed by its performance on real-world benchmarks: On SWE-bench Verified — which evaluates how well AI models and agents solve real-world software engineering problems — it performed 10.5% better than the strongest baseline and reduced cost per task by roughly 50%. But it can go beyond coding: On LongBench‑v2 Multi‑Doc QA — which assesses LLMs’ ability to handle long-context, real-world problems — DeLM had the highest accuracy across four model families, including GPT‑5.4, Claude Sonnet, Gemini Flash, and DeepSeek‑V4‑Pro. DeLM outperforms other models on SWE-Bench for a number of reasons, as Mao detailed on X. First, agents share failures. In ordinary parallel runs, when one agent follows the wrong path, that failure stays private, and subsequent agents may waste time (and money) pursuing the same dead end. But with DeLM, failed hypotheses are written into shared context. “Later agents can read them as constraints, avoid repeated exploration, and redirect their search toward more promising fixes,” Mao said. Additionally, constraints, once verified, are immediately added to agents’ shared context. This means they become a binding shared state. “Later agents inherit them, build around them, and avoid repeating globally invalid simplifications,” Mao said. Crucially, DeLM keeps shared progress compact enough to reuse. It is unfoldable, meaning agents see short gists by default, but can choose to unfold them into more detailed summaries and raw evidence. As the researchers note, providing all raw documents and traces gives agents the maximum amount of information, but that can overwhelm their context windows and ultimately increase costs. “If agents shared full traces, each worker would need to read long command histories, file dumps, failed edits, and intermediate reasoning, turning coordination itself into another long-context bottleneck,” Mao said. On the other hand, while sharing compact summaries is cheaper, important details and evidence can be lost, resulting in less reliable reasoning. Unfolding, therefore, provides “coarse-to-fine” opt-in access. This can improve accuracy and cost.Ultimately, with a framework like DeLM, agents can be more efficient because they are prevented from repeatedly reading the same documents or rerunning the same failed analysis; more effective because useful findings are propagated across parallel threads; and more robust because they only share verified claims. For enterprise builders, DeLM challenges a core assumption: that every multi-agent workflow needs a central controller. The SWE-bench and LongBench-v2 results suggest the decentralized model isn't just theoretically cleaner — it's faster, more accurate, and roughly half the cost.