Talos: Hardware accelerator for deep convolutional neural networks

Talos went through a series of architecture evolutions. Working with low-level digital design and FPGAs isn't just about getting the math right. It is about getting the math right within the hard physical constraints of the DE1-SoC. The FPGA has a fixed number of logic array bits, a fixed amount of memory, and a fixed routing fabric. You can't negotiate with it. Every architectural tweak was forced by those limits.

The First Attempt: Brute Force

Our first attempt was a not-so-genius brute force, running all four cnn and maxpool instances, one for each kernel, simultaneously in parallel. Logically, this is the fastest possible approach. In practice, however, it blew up the DE1-SoC, consuming nearly 4× the available LABs on the chip, making the design too big for the fitter to physically route it. We also initially had 10 instances of a neuron module with a massive port connecting directly to the maxpool outputs. The sheer width of that bus created severe routing congestion, and Quartus threw fitter errors before we even got to timing analysis. The design was simply too big to put on a chip as small as Cyclone V.

The beauty of constraints is that it forces you to think. Think about why something doesn't work and if the approach itself is wrong. In software, you can often brute force your way through and worry about optimizing later. In hardware, however, if it doesn't fit, it doesn't ship.

The Pivot: Time vs Memory

Hardware forces you to choose: it's either insanely fast or takes a whole lot of circuitry. The tradeoff between speed and area is worth noting. If it doesn't fit on the chip, no matter how fast it is, it's useless. Keeping overall memory footprint in mind while squeezing every cycle out at the module level, we decided to use a time-multiplexing architecture. Instead of four parallel instances, we used only one cnn module and one maxpool module, and ran them consecutively four times, one for each kernel. This is the architecture Talos ships with.

This is handled by a finite state machine in the inference module that cycles through the following states:

Click to enlarge Inference FSM — Time-Multiplexed Architecture state: S_IDLE ker_sel: 0 pass: 1 /4 enable cnn_complete mp_complete ker_sel < 3 ker_sel == 3 S_IDLE cnn_en=0 mp_en=0 complete=0 S_CLEAR clear_accum ← 1 ker_sel ← 0 S_CNN cnn_en ← 1 kernel = ker_bus[ker_sel] S_POOL mp_en ← 1 pass_sel = ker_sel S_GAP ker_sel ← ker_sel + 1 cnn_en←0 mp_en←0 S_DONE complete ← 1 neurons[0:9] → Q16.16 State transitions for the time-multiplexed inference control

It starts by setting clear_accum to high in S_CLEAR to reset all 10 neuron accumulators. Then for each pass, the state first changes to S_CNN that sets cnn_en to high, which starts the cnn module and runs the convolution with the kernel selected by ker_sel. Once cnn_complete becomes high, indicating that all kernel operations have been completed, the state moves to S_POOL, where it sets mp_en high, and runs the maxpool module for that pass. After mp_complete goes high, it hits S_GAP, increments ker_sel to go to next kernel, resets the internal buses, and loops back to S_CNN for the next kernel. The neuron accumulators are never cleared between passes and thus, they keep accumulating across all four runs, which is exactly how the weighted sum across 676 inputs of the fully connected layer is supposed to work. After ker_sel hits 3, indicating all neurons have been completed, and the final pass completes, the state goes S_DONE and sets complete to high.

This approach alone allowed us to reduce the LAB (Logic Array Block) footprint to almost half of the initial design, showing that we were indeed moving on the right path.

... continue reading