80386 Early Start Memory Access

When Intel designed the 80386, they gave it a trick for hiding memory latency: Early Start. Instead of waiting for an instruction to reach its memory micro-op, the 386 begins the next instruction's address work — effective address, segment relocation, the bus cycle — in the last cycle of the current instruction. Intel put it at about 9% of overall performance. It is also the source of the POPAD bug.

The z386 FPGA core I released in May ran the original 386 microcode but didn't have early start. Over the last month I added it along with a series of other optimizations, and z386 now reaches ao486-class performance:

core Doom (FPS) 3DBench Landmark z386 0.1 (May) 16.6 33.7 147 z386 0.4 (June) 23.0 44.5 170 ao486 21.0 43.8 204

Doom (original, max details) went up ~39% (16.6 → 23.0), past ao486's 21.0, and the 16-bit 3DBench now edges past ao486 too. The board clock is unchanged from v0.1's 85 MHz, so the gains came entirely from cutting CPI, doing more work per clock. Per-instruction, z386 went from well above the 386's cycle counts to at or below them on nearly everything:

Instruction timings: z386 0.1 → 0.4 vs the original 80386.

The memory pipeline post earlier in this series introduced Early Start as a concept. This post is about building it on an FPGA, plus the rest of the CPI work that got z386 to parity.

Early Start

Intel discussed Early Start in Slager's ICCD '86 paper, "Performance Optimizations of the 80386". The clue to how it works is in the microcode. Here is the entry for an ALU instruction that reads a memory operand ( ADD reg, [mem] ):

; ADD/OR/ADC/SBB/AND/SUB/XOR m,r 04A EFLAGS -> FLAGSB FLGSBA RD 9 04B DLY 04C OPR_R -> TMPB WRITE_RESULT JMP UNL 04D TMPB SRCREG +-&|^

The interesting thing is that the first micro-instruction, 04A , already issues RD — it starts the memory read. No micro-instruction before it computes the effective address, adds the segment base, or checks the limit. Address generation is implicit, done by hardwired logic. A concrete example makes this clearer:

... continue reading