Nvidia's new AI framework trains an 8B model to manage tools like a pro

Researchers at Nvidia and the University of Hong Kong have released Orchestrator, an 8-billion-parameter model that coordinates different tools and large language models (LLMs) to solve complex problems. In their experiments, Orchestrator achieved higher accuracy at a lower cost than much larger models in tool-use benchmarks, while also aligning with user preferences on which tools to use for a given query.The model was trained through ToolOrchestra, a new reinforcement learning (RL) framework for training small models to act as intelligent coordinators. The approach is based on the idea that a small "orchestrator" managing a diverse team of specialized models and tools can be more effective and efficient than a single, monolithic AI system. The findings suggest that this composite approach could pave the way for more practical and scalable AI reasoning systems in the enterprise.The limits of current LLM tool useGiving LLMs access to external tools is a promising way to extend their capabilities beyond their training data and into agentic tasks. By calling on resources like search engines and code interpreters, AI agents can improve their accuracy and perform in-app tasks.However, in the accompanying paper, the researchers argue that the current approach to building tool-using agents doesn't harness the full potential of this paradigm. Most systems equip a single, powerful model with a set of basic tools like a web search or a calculator. They argue that humans, when reasoning, “routinely extend themselves by calling upon resources of greater-than-human intelligence, from domain experts to sophisticated processes and software systems.” Accordingly, LLMs should be able to interact with a wide range of tools in different capacities.The tool orchestration paradigmThe paper proposes a shift from a single-model system to a composite one, managed by a lightweight "orchestrator" model. The orchestrator's job is to analyze a complex task and break it down, invoking the right tools in the right order to arrive at a solution.This toolset includes not only standard utilities like web search and code interpreters, but other LLMs of various capabilities that function as "intelligent tools." For example, the orchestrator can delegate a quantitative question to a math-focused model or a programming challenge to a code-generation model. Instead of placing the entire cognitive load on one large, generalist model, the orchestrator delegates narrowed-down sub-problems to specialized intelligent tools.Based on this concept, the researchers developed ToolOrchestra, a method that uses RL to train a small language model to act as an orchestrator. The model learns when and how to call upon other models and tools, and how to combine their outputs in multi-turn reasoning. The tools are defined in a simple JSON format, specifying their name, description and parameters.The RL training process is guided by a reward system that produces a cost-effective and controllable agent. The reward balances three objectives: The correctness of the final answer, efficiency in cost and latency and alignment with user preferences. For example, the system is penalized for excessive compute usage, and is rewarded for choosing tools that a user has marked as preferred, such as favoring an open-source model over a proprietary API for privacy reasons. To support this training, the team also developed an automatic data pipeline that generated thousands of verifiable training examples across 10 different domains.A small model with big resultsUsing ToolOrchestra, the researchers trained Orchestrator, an 8-billion-parameter model based on Qwen3-8B. They evaluated its performance on three challenging benchmarks: Humanity’s Last Exam (HLE), FRAMES and Tau2-Bench. It was compared against several baselines, including large, off-the-shelf LLMs both with and without tools.The results showed that even powerful models struggled without tools, confirming their necessity for complex reasoning. While adding tools improved performance for large models, it often came with a steep increase in cost and latency. By contrast, the 8B Orchestrator delivered impressive results. On HLE, a benchmark of PhD-level questions, Orchestrator substantially outperformed prior methods at a fraction of the computational cost. On the Tau2-Bench function-calling test, it effectively scheduled different tools, calling a large model like GPT-5 in only about 40% of the steps and using cheaper options for the rest, while still beating an agent that used the large model for every step.The researchers noted that the RL-trained Orchestrator adapted its strategy to new challenges, showing a "high degree of general reasoning ability." Crucially for enterprise applications, Orchestrator also generalized well to models and pricing structures it hadn't seen during training. This flexibility makes the framework suitable for businesses that rely on a mix of public, private and bespoke AI models and tools. The lower cost, higher speed and customizability make it a practical approach for building sophisticated AI agents that can scale.As businesses look to deploy more advanced AI agents, this orchestration approach offers a path toward systems that are not only more intelligent but more economical and controllable. (The model weights are currently available under a non-commercial license, but Nvidia has also released the training code under the permissive Apache 2.0 license.)As the paper concludes, the future may lie in even more advanced versions of this concept: “Looking ahead, we envision more sophisticated recursive orchestrator systems to push the upper bound of intelligence [and] also to further enhance efficiency in solving increasingly complex agentic tasks.”