Nvidia debuts Nemotron 3 with hybrid MoE and Mamba-Transformer to drive efficient agentic AI

Nvidia launched the new version of its frontier models, Nemotron 3, by leaning in on a model architecture that the world’s most valuable company said offers more accuracy and reliability for agents. Nemotron 3 will be available in three sizes: Nemotron 3 Nano with 30B parameters, mainly for targeted, highly efficient tasks; Nemotron 3 Super, which is a 100B parameter model for multi-agent applications and with high-accuracy reasoning and Nemotron 3 Ultra, with its large reasoning engine and around 500B parameters for more complex applications. To build the Nemotron 3 models, Nvidia said it leaned into a hybrid mixture-of-experts (MoE) architecture to improve scalability and efficiency. By using this architecture, Nvidia said in a press release that its new models also offer enterprises more openness and performance when building multi-agent autonomous systems. Kari Briski, Nvidia vice president for generative AI software, told reporters in a briefing that the company wanted to demonstrate its commitment to learn and improving from previous iterations of its models. “We believe that we are uniquely positioned to serve a wide range of developers who want full flexibility to customize models for building specialized AI by combining that new hybrid mixture of our mixture of experts architecture with a 1 million token context length,” Briski said. Nvidia said early adopters of the Nemotron 3 models include Accenture, CrowdStrike, Cursor, Deloitte, EY, Oracle Cloud Infrastructure, Palantir, Perplexity, ServiceNow, Siemens and Zoom.Breakthrough architectures Nvidia has been using the hybrid Mamba-Transformer mixture-of-experts architecture for many of its models, including Nemotron-Nano-9B-v2.The architecture is based on research from Carnegie Mellon University and Princeton, which weaves in selective state-space models to handle long pieces of information while maintaining states. It can reduce compute costs even through long contexts. Nvidia noted its design “achieves up to 4x higher token throughput” compared to Nemotron 2 Nano and can significantly lower inference costs by reducing reasoning token generation by up 60%.“We really need to be able to bring that efficiency up and the cost per token down. And you can do it through a number of ways, but we're really doing it through the innovations of that model architecture,” Briski said. “The hybrid Mamba transformer architecture runs several times faster with less memory, because it avoids these huge attention maps and key value caches for every single token.”Nvidia also introduced an additional innovation for the Nemotron 3 Super and Ultra models. For these, Briski said Nvidia deployed “a breakthrough called latent MoE.”“That’s all these experts that are in your model share a common core and keep only a small part private. It’s kind of like chefs sharing one big kitchen, but they need to get their own spice rack,” Briski added. Nvidia is not the only company that employs this kind of architecture to build models. AI21 Labs uses it for its Jamba models, most recently in its Jamba Reasoning 3B model.The Nemotron 3 models benefited from extended reinforcement learning. The larger models, Super and Ultra, used the company’s 4-bit NVFP4 training format, which allows them to train on existing infrastructure without compromising accuracy.Benchmark testing from Artificial Analysis placed the Nemotron models highly among models of similar size. New environments for models to ‘work out’As part of the Nemotron 3 launch, Nvidia will also give users access to its research by releasing its papers and sample prompts, offering open datasets where people can use and look at pre-training tokens and post-training samples, and most importantly, a new NeMo Gym where customers can let their models and agents “workout.” The NeMo Gym is a reinforcement learning lab where users can let their models run in simulated environments to test their post-training performance. AWS announced a similar tool through its Nova Forge platform, targeted for enterprises that want to test out their newly created distilled or smaller models. Briski said the samples of post-training data Nvidia plans to release “are orders of magnitude larger than any available post-training data set and are also very permissive and open.”Nvidia pointed to developers seeking highly intelligent and performant open models, so they can better understand how to guide them if needed, as the basis for releasing more information about how it trains its models. “Model developers today hit this tough trifecta. They need to find models that are ultra open, that are extremely intelligent and are highly efficient,” she said. “Most open models force developers into painful trade-offs between efficiencies like token costs, latency, and throughput.”She said developers want to know how a model was trained, where the training data came from and how they can evaluate it.