Inside Google's TPU V8 strategy, delivering two chips for two crucial tasks at incredible scale — network scales up to 1 million TPUs per cluster, an advantage over Nvidia AI accelerators

Google announced its eighth-generation Tensor Processing Units at Cloud Next on April 22, shipping two distinct chip designs for the first time in the TPU program's decade-long history. The two chips — TPU 8t and TPU 8i — are intended for use in different workloads. TPU 8t targets large-scale model training, while TPU 8i is built for low-latency inference and reasoning workloads.

The split also extends to the supply chain, with MediaTek having joined Broadcom as a silicon design partner for the eighth-gen program back in December, ending Broadcom’s exclusive role in TPU development since 2015. Both chips are fabricated on TSMC's N3 process family with HBM3E memory and will be available to Google Cloud customers later this year.

Optionality for customers

In terms of raw specs, TPU 8 doesn’t close the gap with Nvidia or AMD. According to Google’s own technical deep dive, the TPU 8t delivers 12.6 FP4 PFLOPs with 216 GB of HBM3e running at 6,528 GB/s, while TPU 8i offers 10.1 FP4 PFLOPs, 288 GB of HBM3e at 8,601 GB/s, and 384 MB of on-chip SRAM. In comparison, Nvidia's Vera Rubin R200 is rated at 35 FP4 PFLOPs for training with 288GB of HBM4 at 22 TB/s, and AMD's MI455X reaches 40 FP4 PFLOPs with 432GB of HBM4. That makes the gap roughly 3:1 in raw compute per-socket.

Article continues below

Then there’s the choice of HBM3E over HBM4, which appears to be a deliberate cost and yield trade-off. TPU 8t carries 12.5% more memory capacity than the previous-gen Ironwood TPU, but delivers 11.5% less bandwidth, running slower memory to improve yield and bring down cost per chip per analysis from Next Platform. This is an odd strategy on the face of it, but it seems that Google, rather than trying to take on Nvidia in terms of raw performance, is creating options for external customers that want alternatives.

A TPU 8t superpod packs 9,600 chips into a single cluster with two petabytes of shared HBM, connected by a proprietary inter-chip interconnect running at double the previous generation's bandwidth. Google claims 121 FP4 ExaFLOPs from a single superpod, with the new Virgo Network fabric tying up to 134,000 TPU 8t chips into a single non-blocking data center fabric with 47 PB/s of bisection bandwidth, extending past 1 million chips across multiple sites.

So, yes, while individual Nvidia GPUs are faster, Google holds an advantage with its pod-level throughput at that mass scale; training workloads consume thousands of accelerators, not one, and Nvidia’s current-gen GPUs top out at 576 accelerators in a single NVLink deployment.

Interestingly, Google also announced Vera Rubin NVL72 instances running over the same Virgo Network fabric at Cloud Next, so TPUs are clearly not intended to act as a direct replacement for Nvidia silicon.

Swipe to scroll horizontally Google TPU 8 Specs Row 0 - Cell 0 TPU 8t TPU 8i Workload Large-scale pre-training Sampling, serving, and reasoning Network topology 3D Torus Boardfly Specialized chip features SparseCore (Embeddings) & LLM Decoder Engine CAE (Collectives Acceleration Engine) HBM capacity 216 GB 288 GB On-chip SRAM 128 MB 384 MB Peak FP4 PFLOPs 12.6 10.1 HBM bandwidth 6,528 GB/s 8,601 GB/s CPU header Arm Axion Arm Axion

... continue reading