Earlier this week, Nvidia surprise-announced their new Vera Rubin architecture (no relation to the recently unveiled telescope) at the Consumer Electronics Show in Las Vegas. The new platform, set to reach customers later this year, is advertised to offer a ten-fold reduction in inference costs and a four-fold reduction in how many GPUs it would take to train certain models, as compared to Nvidia’s Blackwell architecture.
The usual suspect for improved performance is the GPU. Indeed, the new Rubin GPU boasts 50 quadrillion floating-point operations per second (petaFLOPS) of 4-bit computation, as compared to 10 petaflops on Blackwell, at least for transformer-based inference workloads like large language models.
However, focusing on just the GPU misses the bigger picture. There are a total of six new chips in the Vera-Rubin-based computers: the Vera CPU, the Rubin GPU, and four distinct networking chips. To achieve performance advantages, the components have to work in concert, says Gilad Shainer, senior vice president of networking at Nvidia.
“The same unit connected in a different way will deliver a completely different level of performance,” Shainer says. “That’s why we call it extreme co-design.”
Expanded “in-network compute”
AI workloads, both training and inference, run on large numbers of GPUs simultaneously. “Two years back, inferencing was mainly run on a single GPU, a single box, a single server,” Shainer says. “Right now, inferencing is becoming distributed, and it’s not just in a rack. It’s going to go across racks.”
To accommodate these hugely distributed tasks, as many GPUs as possible need to effectively work as one. This is the aim of the so-called scale-up network: the connection of GPUs within a single rack. Nvidia handles this connection with their NVLink networking chip. The new line includes the NVLink6 switch, with double the bandwidth of the previous version (3,600 gigabytes per second for GPU-to-GPU connections, as compared to 1,800 GB/s for NVLink5 switch).
In addition to the bandwidth doubling, the scale-up chips also include double the number of SerDes—serializer/deserializers (which allow data to be sent across fewer wires) and an expanded number of calculations that can be done within the network.
“The scale-up network is not really the network itself,” Shainer says. “It’s computing infrastructure, and some of the computing operations are done on the network…on the switch.”
The rationale for offloading some operations from the GPUs to the network is two-fold. First, it allows some tasks to only be done once, rather than having every GPU having to perform them. A common example of this is the all-reduce operation in AI training. During training, each GPU computes a mathematical operation called a gradient on its own batch of data. In order to train the model correctly , all the GPUs need to know the average gradient computed across all batches. Rather than each GPU sending its gradient to every other GPU, and every one of them computing the average, it saves computational time and power for that operation to only happen once, within the network.
... continue reading