Visualizing CPU Pipelining (2024)

This is part of my branch prediction series. Visualizing CPU Pipelining Why Branch Prediction Needs Real Data Hacking LLDB to Evaluate Branch Predictions (coming soon)

I want to share what I’ve learned about CPU pipelining. Thanks to Dan Luu’s branch prediction write-up, I was vaguely aware how this worked conceptually, but I was motivated to dive into the details after reading Rodrigo Copetti’s Playstation MIPS write-up where he talked about branch delay slots and how they evolved into branch prediction. I quickly found many subtle and fascinating details on CPU pipelining that I had to share. For more details, I recommend chapter 4 of Computer Organization and Design.

This post will assume you are familiar with the laundry or “assembly-line” model of CPU pipelining, but are hazy on some of the lower-level details. It will also help to have a vague idea of the 5-stage MIPS pipeline (this is a nice, brief high-level summary), since our model CPU will be the 32-bit MIPS.

Visualizing pipelines Let’s start by visualizing a basic CPU model that does not have pipelining (aka a single-cycle CPU design): One bottleneck is that only one component of the CPU is active at a time: all others are inactive. Pipelined CPUs fill in these vacancies by running instructions through the stages one after the other, rather than waiting for a single instruction to completely finish. This seems pretty natural: it’s like an multiple person assembly line. But there are already some subtle problems that need to be solved.

Instruction Decoding Instruction decoding orchestrates the entire pipeline by providing fields used by the other stages. Let’s work through an simplified example. Say the instruction is add $t1, $s2, $s3 The IF stage fetches the instruction and puts it into the ID stage register. Now the ID register contains fields that will be used by all the remaining stages. pc: 0x014b4820 -> Field op rs rt rd shamt funct Binary 000000 01010 01011 01001 00000 100000 # add $s2 $s3 $t1 EX will use op , rs , and rt for the ALU operation and inputs

, , and for the ALU operation and inputs MEM (for lw and sw ) uses rt

and ) uses WB will write to register rd With pipelining, we have a problem. For example, in cycle 3 in the previous visualization, add moves into EX and sub moves into ID. But WB needs to know the rd of the add instruction! That field used to be safely stored in the ID register, but sub overwrote it. The solution is to have registers between each pipeline stage that carry fields from the ID stage (as well as other stages). Now when add reaches WB, it has the rd field so it can write to the correct register. The actual value from EX or MEM is also stored in the MEM/WB register as well.

Hazard detection The field metadata from ID needs to be available at each stage to enable basic operations like writing back to the registers, but the fields also turn out to be crucial to solve data hazards. To start, let’s look at this example: sub has a dependency on the output of add : this is called a data hazard. Structurally, the CPU could proceed but the result would be incorrect. We need a way to check if any of the inputs of the instruction in ID match the output register for any downstream instructions. Since the field metadata is propagated to each stage register, we have all the data we need! What’s missing is a unit to calculate this: ID/EXE.rs == EX/MEM.rd || ID/EXE.rt == EX/MEM.rd || ID/MEM.rs == MEM/WB.rd || ID/MEM.rt == MEM/WB.rd That’s the Hazard Detection Unit (HDU). This unit will both detect hazards and prevent progress until the hazard is resolved. This hazard detection logic works for “R-type” instructions that write to registers (like add and and ), but not every instruction writes to registers (like sw ). We’d technically need to check a few other conditions for a complete implementation, especially around data hazards involving lw and sw . The solid lines show the inputs to the HDU, detecting the hazard. The dotted lines are the control signals, stopping the PC and IF and writing a nop instruction to the ID/EX register. The nop will proceed through each stage, eventually causing the hazard comparison logic to be false. This will lift the stall on the PC and IF stages, and remove the signal allowing the sub instruction to continue. Why are stalls also called bubbles? Because once they appear they flow through the pipeline until they pop out of the end. Like a diver who lets out air underwater, the bubble will rise to the surface. Stretching the analogy further, the diver, like the HDU, can let out more than one bubble depending on the circumstance. The previous example let out 2 bubbles.

Forwarding There’s another way we can use the hazard detection logic on the register metadata to resolve certain types of hazards. Instead of stalling, we can “forward” an intermediate result to the next instruction. Aside: Even though the data is moving back in the pipeline, it’s called forwarding because in a multi-clock pipelining diagram the dependencies can be written “forward” in time. See page 364 of Computer Organization and Design 4th Ed. for more discussion. Let’s continue with our add and sub example. The between-stage registers not only hold metadata, but they also hold the results of the prior stage. If we detect a hazard, we can immediately use that result instead of waiting for the instruction to write back to the register and exit the pipeline. The forwarding unit (FU) uses the same comparison logic and if it detects a hazard the control sub-unit ensures the EX result is used instead of the input register data ( r3 in this case). For this particular case, the hazard resolves with zero stalls! (see figure 4.60 in Computer Organization and Design 4th Ed. for the full picture, this is a bit of a simplification).

HDU and FU In the FU diagram, I did not add the comparison between the ID/EX input registers and the MEM/WB output registers that I had for the HDU. The problem is that forwarding can’t handle this problem alone. Let’s look at this example: lw r1, 0(r2) add r3, r1, r4 Our HDU handles this situation just fine: in fact, it’s identical to our previous example where we must stall until WB writes back to register r1 , creating 2 bubbles. But there’s an opportunity here: the intermediate result is available in the MEM/WB register before it’s written to r1 so we could forward to add ’s EX stage. To do this, the FU and HBU have to work in conjunction to save 1 bubble. I’m not going to go into all the details on how the HDU and FU work together, because it’s a little bit different than how I’ve laid it out here. And as I hinted at with this example, there are nuances around lw and sw instructions you have to be careful with.

... continue reading