Nvidia's Vera Rubin platform in depth — Inside Nvidia's most complex AI and HPC platform to date

As Nvidia ships millions of Grace CPUs and Blackwell AI GPUs to data centers worldwide, the company is hard at work bringing up its next-generation AI and HPC platform, Vera Rubin, which is expected to set a new standard for performance and efficiency. Nvidia's Vera Rubin comprises not one or two, but nine separate processors, each tailored for a particular workload, creating one of the most complex data center platforms ever.

While Nvidia will be disclosing more details about its Vera Rubin over the coming year before it officially launches in late 2025, let's recap what we already know about the platform, as the company has revealed a fair few details.

At a glance

On the hardware side, Nvidia's Vera Rubin platform is its next-generation rack-scale AI compute architecture built around a tightly integrated set of components. These include the following: an 88-core Vera CPU, Rubin GPU with 288 GB HBM4 memory, Rubin CPX GPU with 128 GB of GDDR7, NVLink 6.0 switch ASIC for scale-up rack-scale connectivity, BlueField-4 DPU with integrated SSD to store key-value cache, Spectrum-6 Photonics Ethernet and Quantum-CX9 1.6 Tb/s Photonics InfiniBand NICs, as well as Spectrum-X Photonics Ethernet and Quantum-CX9 Photonics InfiniBand switching silicon for scale-out connectivity.

(Image credit: Nvidia/YouTube)

A full NVL144 rack integrates 144 Rubin GPUs (in 72 packages) with 20,736 TB of HBM4 memory and 36 Vera CPUs to deliver up to 3.6 NVFP4 ExaFLOPS for inference and up to 1.2 FP8 ExaFLOPS for training performance. In contrast, NVL144 CPX achieves almost 8 NVFP4 ExaFLOPS for inference using Rubin CPX accelerators, providing even more massive compute density.

On the software side, the Rubin generation is optimized for FP4/FP6 precision, million-token context inference, and multi-modal generative workloads. The CPX systems will come with Nvidia's Dynamo inference orchestrator built atop CUDA 13, which is designed to intelligently manage and split inference workloads across different types of GPUs in a disaggregated system.

Additionally, Nvidia's Smart Router and GPU Planner will dynamically balance prefill and decode workloads across Mixture-of-Experts (MoE) replicas to improve utilization and response time. Also, Nvidia's Interconnect Extension Layer (NIXL) enables zero-copy data transfers between GPUs and NICs through InfiniBand GPUDirect Async (IBGDA) to reduce latency and CPU overhead. Meanwhile, NVMe key-value cache offload is said to achieve 50% – 60% hit rates, enabling multi-turn conversational context to persist efficiently. Finally, the new NCCL 2.24 library is expected to reduce small-message latency by 4x, enabling the scaling of trillion-parameter agentic AI models with much faster inter-GPU communication.

Truth to be told, these features are not specific to the Vera Rubin platform, but Rubin-class systems benefit the most from them, as the platform was designed explicitly to exploit them at scale. But what is so special about the Vera Rubin platform? Let's dig a little bit deeper.

The Vera CPU

... continue reading