GTC 2026: Ian Buck press Q&A transcript — VP of Hyperscale and HPC speaks out on shelving CPX and shipping LPU decode this year

Following Nvidia's GTC 2026 keynote, where CEO Jensen Huang laid out the company's Vera Rubin architecture and the Groq 3 LPU acquisition, Nvidia VP of Hyperscale and HPC Ian Buck sat down with press for a Q&A session in San Jose.

Buck addressed CPX's delay, the LPU-GPU decode architecture, the Vera CPU’s role in the AI data center, and took questions on the Intel NVLink Fusion partnership. This is a full transcript of a session attended by Tom's Hardware at GTC 2026, and as such, the transcript can occasionally be unclear; we have denoted any moments as such within the copy. Before diving into the transcript, it's worth refamiliarizing yourself with Jensen Huang's keynote from last week, which we've linked below.

NVIDIA GTC Keynote 2026 - YouTube Watch On

CPX delay and LPU decode architecture

Ian Buck: As part of bringing the LPU to market this year with Vera Rubin... we've pulled CPX. It's still a good idea, but in order to dedicate our focus on... optimizing the decode with LPU this year. So we'll be thinking about CPX more in the next generation [and] we're going to execute on decode with LPU now, this year.

Article continues below

A couple other things I wanted to touch on. I also get a lot of questions about how we're doing this. How does the LPU work? How does it work with the GPU? Jensen went over it briefly. This might be more technical, but it's an important point.

The way we do the decode is with a Groq 3 LPU LPX rack. Here we have 256 LPU chips combined with a Vera Rubin NVL72. We're going to do the decode using Dynamo. We've combined the two teams, so Groq's software team has joined our Dynamo team.

We now do not only disaggregation of separate GPUs that you pre-fill and decode, but also the decode itself is actually split between the LPU and GPU. That's what makes the extremely fast token generation economical. We can focus and run the computations that benefit from the fast SRAM of the LPU over here in one layer, and literally the next layer, we can send the intermediate activation state over to the GPUs to do all the attention math, all the softmax, all the routing, all the KV calculations, so that only the LPUs need to have copies of the weights. All the per-query state, all the KV[cache] state, which can get quite large, can operate and stay in the HBMs.

Of course, both processors can do both things. The LPUs can do the attention math. Obviously the GPUs can do the [...] as well. So you can optimize for resiliency.

... continue reading