Finding a CPU Design Bug in the Xbox 360

The recent reveal of Meltdown and Spectre reminded me of the time I found a related design bug in the Xbox 360 CPU – a newly added instruction whose mere existence was dangerous.

Back in 2005 I was the Xbox 360 CPU guy. I lived and breathed that chip. I still have a 30-cm CPU wafer on my wall, and a four-foot poster of the CPU’s layout. I spent so much time understanding how that CPU’s pipelines worked that when I was asked to investigate some impossible crashes I was able to intuit how a design bug must be their cause. But first, some background…

The Xbox 360 CPU is a three-core PowerPC chip made by IBM. The three cores sit in three separate quadrants with the fourth quadrant containing a 1-MB L2 cache – you can see the different components, in the picture at right and on my CPU wafer. Each core has a 32-KB instruction cache and a 32-KB data cache.

Trivia: Core 0 was closer to the L2 cache and had measurably lower L2 latencies.

The Xbox 360 CPU had high latencies for everything, with memory latencies being particularly bad. And, the 1-MB L2 cache (all that could fit) was pretty small for a three-core CPU. So, conserving space in the L2 cache in order to minimize cache misses was important.

CPU caches improve performance due to spatial and temporal locality. Spatial locality means that if you’ve used one byte of data then you’ll probably use other nearby bytes of data soon. Temporal locality means that if you’ve used some memory then you will probably use it again in the near future.

But sometimes temporal locality doesn’t actually happen. If you are processing a large array of data once-per-frame then it may be trivially provable that it will all be gone from the L2 cache by the time you need it again. You still want that data in the L1 cache so that you can benefit from spatial locality, but having it consuming valuable space in the L2 cache just means it will evict other data, perhaps slowing down the other two cores.

Normally this is unavoidable. The memory coherency mechanism of our PowerPC CPU required that all data in the L1 caches also be in the L2 cache. The MESI protocol used for memory coherency requires that when one core writes to a cache line that any other cores with a copy of the same cache line need to discard it – and the L2 cache was responsible for keeping track of which L1 caches were caching which addresses.

But, the CPU was for a video game console and performance trumped all so a new instruction was added – xdcbt. The normal PowerPC dcbt instruction was a typical prefetch instruction. The xdcbt instruction was an extended prefetch instruction that fetched straight from memory to the L1 d-cache, skipping L2. This meant that memory coherency was no longer guaranteed, but hey, we’re video game programmers, we know what we’re doing, it will be fine.

Oops.

... continue reading