How Nvidia's $20 billion Groq 3 LPU deal reshapes the Nvidia Vera Rubin Platform — Samsung 4nm process serves as bedrock for SRAM-based AI accelerator chip

Nvidia unveiled the Groq 3 language processing unit at GTC 2026 in San Jose on Monday, marking the first chip to emerge from its $20 billion licensing and talent deal with AI inference startup Groq, which was struck on Christmas Eve last year. The SRAM-based inference accelerator slots into the Vera Rubin platform as a dedicated decode-phase co-processor, and Nvidia plans to ship it in Q3 2026, manufactured by Samsung on a 4nm process. It is the company's first rack-scale product built around non-GPU silicon — and its arrival has already displaced a homegrown Nvidia chip from the roadmap.

The LP30 chip at the heart of the Groq 3 LPX rack carries 512 MB of on-chip SRAM per die, delivering 150 TB/s of memory bandwidth. That figure dwarfs the 22 TB/s available from the 288 GB of HBM4 on each Rubin GPU. A full LPX rack houses 256 LPUs for a total of 128GB of SRAM and 40 PB/s of aggregate bandwidth. Nvidia claims the LPX rack, paired with a Vera Rubin NVL72, delivers 35 times higher throughput per megawatt than Blackwell NVL72 alone for trillion-parameter models, at a target price point of $45 per million tokens.

Groq 3 and Vera Rubin

Nvidia detailed its entire seven-chip Rubin SuperPOD strategy at GTC 2026. (Image credit: Nvidia)

Rubin GPUs handle the compute-intensive prefill phase of a query, processing long input contexts, while Groq LPUs take over the decode phase, generating output tokens at low latency. Nvidia's Dynamo orchestration platform manages the split across heterogeneous hardware, distributing workloads based on batch size and parallelism requirements.

Article continues below

The original, pre-Nvidia Groq LPU design used a fixed Very Long Instruction Word (VLIW) pipeline and large on-chip SRAM pools, with the compiler pre-scheduling the entire execution path at compile time, which meant deterministic latency with no cache misses or stalls. These chips also demonstrated raw single-user token rates in the thousands per second, but the architecture's weakness was always capacity. At 230MB of SRAM per chip in prior generations, fitting even medium-sized models required high chip counts, and the architecture was initially designed for convolutional neural networks.

The Groq LP30 addresses some of these limitations with 512 MB of SRAM per die and 1.23 FP8 PFLOPS of compute capability. Samsung has ramped production from roughly 9,000 wafers to about 15,000 wafers as output shifts from samples to commercial manufacturing, with AWS announcing at GTC that it will deploy Groq 3 LPUs alongside more than one million Nvidia GPUs as part of an expanded partnership.

Beyond the LP30, a future LP35 will add NVFP4 support, aligning with the Rubin Ultra generation, and an LP40 is planned for the Feynman architecture cycle after that.

Rubin CPX axed?

... continue reading