Branch prediction: Why CPUs can't wait?

Recently, I have had some free time and started learning some low-level computer fundamentals, trying to practice and better understand the concepts in greater detail. Along the way, I learned about a modern compiler infrastructure named LLVM, which is a target-independent optimizer and code generator that has been used as a part of many compilers for different programming languages like Rust, Zig, or Haskell. While diving into LLVM’s aggressive optimization features, I have a humbling realization: even the most advanced compiler optimizations cannot save you from a fundamental limitation at the hardware level. If our program has unpredictable branches, all of the clever LLVM optimizations become secondary to something called “branch prediction”. Poor branch prediction patterns can easily negate the benefits of multiple LLVM optimizations.

There are some latency numbers that every programmer should know, according to Jeff Dean, and one of them is branch misprediction, which costs around 5ns in 2012, and the latency remains roughly the same as the time of writing this post. So what is branch prediction, what happens when it’s mispredicted, and why is it costly? Before diving into the explanation, I think it’s worth mentioning what kind of instruction execution pattern the CPU ideally expects. Because memory layout is linear, and when a program is compiled and stored in memory, its instructions are stored as a continuous sequence:

Markdown Memory Address Instruction PC Value 0x1000 MOV EAX, 5 PC = 0x1000 0x1003 ADD EAX, 10 PC = 0x1003 0x1006 MOV [0x2000], EAX PC = 0x1006 0x100A RET PC = 0x100A Memory Address Instruction PC Value 0x1000 MOV EAX, 5 PC = 0x1000 0x1003 ADD EAX, 10 PC = 0x1003 0x1006 MOV [ 0x2000 ], EAX PC = 0x1006 0x100A RET PC = 0x100A

The natural way to read this is from top to bottom, just like reading a book. In a CPU, there is a special register named Program Counter (PC), and its job is to record the address of the current instruction. The CPU’s program counter naturally increments to the next sequential instruction address:

[math] PC = PC + instruction\ size[/math]

The instruction size is the size of the current instruction, for example, after executing MOV EAX, 5 (3 bytes), The CPU automatically sets:

[math] PC\ =\ 0x1000\ +\ 3\ =\ 0x1003[/math]

So the CPU can find out the address of the next instruction and then execute the next one. But one important detail here is that to complete an instruction, there are multiple internal stages that each instruction will always have, and one question here is whether the CPU will just wait for all of the stages on one instruction to finish, then execute the next one? To answer this question, let’s first take a look at the different internal stages that a typical processor needs to perform on each instruction:

Instruction Fetch (IF): Get the instruction from the memory Instruction Decode (ID): Figure out what it means Execute (EX): Perform the operation Memory Access (MEM): Access memory if needed Write Back (WB): Store the result in the memory

Modern CPUs employ a technique called instruction pipelining, meaning that even though one instruction hasn’t finished all of its stages, the next instruction could be started in parallel immediately, so the idea is to exploit all the hardware power that the CPU has. What it means is that each of the instruction stages will be handled by different hardware parts, for example, there could be Fetch Hardware Units, Decode Hardware Units, Execute Hardware Units. While the Execute Hardware Unit is working on the Execute stage of the current instruction, the Fetch Hardware Unit could be working on the Fetch stage of the next instruction in parallel.

... continue reading