Touching the Elephant - TPUs
Understanding the Tensor Processing Unit
Reed
Reed December 1, 2025
Something New
There is mythological reverence for Google’s Tensor Processing Unit. While the world presently watches NVIDIA’s gravity drag more companies into its orbit, there sits Google, imperial and singular. Lots of companies participate in the “Cambrian-style explosion of new-interesting accelerators”[14] – Groq, Amazon, and Tenstorrent come to mind – but the TPU is the original existence proof. NVIDIA should take credit for the reemergence of deep learning, but the GPU wasn’t designed with deep learning in mind. What’s strange is that the TPU isn’t a secret. This research is indebted to Google’s public chest-thumping, but the devices themselves have long been exclusive to Google’s datacenters. That is over a decade of work on a hardware system sequestered behind their walls. That the TPU is so well documented yet without a true counterpart creates a strange asymmetry. Google is well positioned in the AI race because of their decision over a decade ago to build a hardware accelerator. It is because of the TPU.
On the back of DistBelief Google had gotten neural networks running at scale. In 2013 however they realized that they would need to double their datacenter capacity to meet the growing demand for these new services. “Even if this was economically reasonable, it would still take significant time, as it would involve pouring concrete, striking arrangements for windmill farm contracts, ordering and installing lots of computers, etc.” [14] The race against the clock began, and 15 months later the TPU was born. Fast forward to April of this year when Sundar Pichai announced the 7th generation TPU, Ironwood, at Google Cloud Next. The headline figures were eye-popping. 9,216 chips in a pod, 42.5 Exaflops, 10 MW [21]. In 12 years the TPU went from a research project to a goliath rack-scale system.
Perhaps reverence is warranted. The development of the TPU is set against the backdrop of a changing hardware scaling landscape. It used to be that to get better programs you just had to wait. With each new generation of chip Moore’s Law and Dennard Scaling brought enormous tailwinds in transistor density, power efficiency, and wall clock improvements. But in the aughts and 2010s there was no more sitting and no more waiting. The advancements in chip physics were not producing exponential returns as they once had, and workload demands continued growing.
Casting this as mythology however obscures the details and risks making the TPU seem like magic. The development of the TPU is the story of trade-offs and constraints and co-design. It touches hardware, software, algorithms, systems, network topology, and everything in between. It did not happen by accident, but through the deliberate process of design and iteration. When thinking about the TPU it’s natural to ask:
How did we get here?
... continue reading