AMD's Pre-Zen Interconnect: Testing Trinity's Northbridge

Today, AMD’s Infinity Fabric interconnect is ubiquitous across the company’s lineup. Infinity Fabric provides well-defined interfaces to a transport layer, and lets different functional blocks treat the interconnect as a black box. The system worked well enough to let AMD create integrated GPU products all the way from the Steam Deck’s tiny van Gogh APU, to giant systems packing four MI300A chips. Across all those offerings, Infinity Fabric enables coherent memory access as CPU and GPU requests converge to Coherent Slaves (CS-es), which have a probe filter and can request data from the appropriate source.

AMD was also trying to build powerful iGPUs over a decade ago, but their interconnect looked very different at the time. Their Northbridge architecture owes its name to the Athlon 64 era, when AMD brought the chipset northbridge’s functionality onto the CPU die. AMD’s engineers likely didn’t envision needing to tightly integrate a GPU during Athlon 64’s development, so the interconnect was designed to tie together a few CPU cores and service memory requests with low latency. But then AMD acquired ATI, and set about trying to take advantage of their new graphics talent by building powerful iGPUs.

“Trinity” is an AMD Accelerated Processing Unit (APU) from 2012. It combines two dual-threaded Piledriver CPU modules with a 6-SIMD Terascale 3 iGPU. Here I’ll be looking into AMD’s second generation iGPU interconnect with the A8-5600K, a slightly cut-down Trinity implementation with four Terascale 3 SIMDs enabled and CPU boost clocks dropped from 4.2 to 3.9 GHz. I’m testing the chip on a MSI FM2-A75MA-E35 board with 16 GB of Kingston DDR3-1866 10-10-9-26 memory.

Trinity’s on-die network resembles that of AMD’s first APU, Llano, and has clear similarities to AMD’s Athlon 64 Northbridge. The Northbridge sits on a separate voltage and frequency domain, and runs at 1.8 GHz on the A8-5600K. It uses a two-level crossbar setup. CPU cores connect to a System Request Interface (SRI), which routes requests onto a set of queues. Most memory requests head to a System Request Queue (SRQ). The SRI also accepts incoming probes on a separate queue, and routes them to the CPU cores. A second level crossbar, simply called the XBAR, connects with the SRI’s various queues and routes requests to IO and memory. The XBAR can handle IO-to-IO communication too, though such traffic is rare on consumer systems.

On Trinity, the XBAR’s scheduler (XCS) has 40 entries, making it slightly smaller than the 64 entry XCS on desktop and server Piledriver chips. AMD defaults to a 22+10+8 entry split between the SRI, Memory Controller (MCT),and upstream channel, though the BIOS can opt for a different XCS entry allocation. The XBAR sends memory requests to the MCT, which prioritizes requests based on type and age, and includes a strided-access prefetcher. Like Infinity Fabric’s CS, the MCT is responsible for ensuring cache coherency and can send probes back to the XBAR. The MCT also translates physical addresses to “normalized” addresses that only cover the memory space backed by DRAM.

All AMD CPUs from Trinity’s era use a similar SRI+XBAR setup, but Trinity’s iGPU sets it apart from CPU-only products. The iGPU gets its own Graphics Memory Controller (GMC), which arbitrates between different request types and schedules requests to maximize DRAM bandwidth utilization. Thus the GMC performs an analogous role to the MCT on the CPU side. A “Radeon Memory Bus” connects the GMC to the DRAM controllers, bypassing the MCT. AMD documentation occasionally refers to the Radeon Memory Bus as “Garlic”, likely a holdover from internal names used during Llano’s development.

A second control link hooks the GPU into the XBAR, much like any other IO device. Previously, the control link was called the “Fusion Control Link”, or “Onion”. I’ll use “Garlic” and “Onion” to refer to the two links because those names are shorter.

The “Garlic” Link (Radeon Memory Bus)

Trinity’s “Garlic” link lets the GPU saturate DRAM bandwidth. It bypasses the MCT and therefore bypasses the Northbridge’s cache coherency mechanisms. That’s a key design feature, because CPUs and GPUs usually don’t share data. Snooping the CPU caches for GPU memory requests would waste power and flood the interconnect with probes, most of which would miss.

Bypassing the MCT also skips the Northbridge’s regular memory prioritization mechanism, so the MCT and GMC have various mechanisms to keep the CPU and GPU from starving each other of memory bandwidth. First, the MCT and GMC can limit how many outstanding requests they have on the DRAM controller queues (DCQs). Then, the DCQs can alternate between accepting requests from the MCT and GMC, or can be set to prioritize the requester with fewer outstanding requests. Trinity defaults to the former. Finally, because the CPU tends to be more latency sensitive, Trinity can prioritize MCT-side reads ahead of GMC requests. Some of this arbitration seems to happen at queues in front of the DCQ. AMD’s BIOS and Kernel Developer’s Guide (BKDG) indicates there’s a 4-bit read pointer for a sideband signal FIFO between the GMC and DCT, so the “Garlic” link may have a queue with up to 16 entries.

... continue reading