Show HN: I built a toy TPU that can do inference and training on the XOR problem

Nobody really understands how TPUs work…and neither do we! So we wanted to make this because we wanted to take a shot and try to guess how it works–from the perspective of complete novices!

We wanted to do something very challenging to prove to ourselves that we can do anything we put our mind to. The reasoning for why we chose to build a TPU specifically is fairly simple:

None of us have real professional experience in hardware design, which, in a way, made the TPU even more appealing since we weren't able to estimate exactly how difficult it would be. As we worked on the initial stages of this project, we established a strict design philosophy: ALWAYS TRY THE HACKY WAY. This meant trying out the "dumb" ideas that came to our mind first BEFORE consulting external sources. This philosophy helped us make sure we weren't reverse engineering the TPU, but rather re-inventing it, which helped us derive many of the key mechanisms used in the TPU ourselves.

We also wanted to treat this project as an exercise to code without relying on AI to write for us, since we felt that our initial instinct recently has been to reach for these AI tools whenever we faced a slight struggle. We wanted to cultivate a certain style of thinking that we could take forward with us and use in any future endeavours to think through difficult problems.[1]

Before we move forward, we want to make it clear what this article covers and what it doesn't. Note that this is NOT a 1-to-1 replica of the TPU — it is our attempt at re-inventing the TPU ourselves.

Throughout this project we tried to learn as much as we could about the fundamentals of deep learning, hardware design and creating algorithms. We found that the best way to learn about this stuff is by drawing everything out and making that our first instinct. As you read this post, you will see how our explanations were inspired by this philosophy.

Specifically, the TPU is very efficient at performing matrix multiplications, which make up 80-90% of the compute operations in transformers (up to 95% in very large models) and 70-80% in CNNs. Each matrix multiplication represents the calculation for a single layer in an MLP, and in deep learning, we have many of these layers, making TPUs increasingly efficient for larger models.

In the example above, the value of the signal b at the next clock cycle is set to the current value of the signal a. You'll find that in most cases, signals (variables) are updated in sequential clock cycles, as opposed to immediate updates like you would find in software design.

The language we use to describe hardware is called Verilog. It's a hardware description language that allows us to describe the behaviour of a given hardware module (similar to functions in software), but instead of executing as a program, it synthesizes into boolean logic gates (AND, OR, NOT, etc.) that can be combined to build the digital logic for any chip we want. Here's a simple example of an addition in Verilog:

Quick primer on hardware design: In hardware, the unit of time we're dealing with is called a clock cycle. This is an arbitrary period of time that we can set, as developers, to meet our requirements. Generally, a single clock cycle can range from 1 picosecond (ps) to 1 nanosecond (ns) and any operations we run will be executed BETWEEN clock cycles.

... continue reading