NUMA, or Non-Uniform Memory Access, lets hardware expose affinity between cores and memory controllers to software. NUMA nodes traditionally aligned with socket boundaries, but modern server chips can subdivide a socket into multiple NUMA nodes. It’s a reflection of how non-uniform interconnects get as core and memory controller counts keep going up. AMD designates their NUMA modes with the NPS (Nodes Per Socket) prefix.
NPS0 is a special NUMA mode that goes in the other direction. Rather than subdivide the system, NPS0 exposes a dual socket system as a single monolithic entity. It evenly distributes memory accesses across all memory controller channels, providing uniform memory access like in a desktop system. NPS0 and similar modes exist because optimizing for NUMA can be complicated and time intensive. Programmers have to specify a NUMA node for each memory allocation, and take are to minimize cross-node memory accesses. Each NUMA node only represents a fraction of system resources, so code pinned to a NUMA node will be constrained by that node’s CPU core count, memory bandwidth, and memory capacity. Effort spent getting an application to scale across NUMA nodes might be effort not spent on a software project’s other goals.
From AMD’s EPYC 9005 Series Architecture Overview, showing a dual socket Zen 5 (Turin) setup in NPS1 mode
Acknowledgements
A massive thank you goes to Verda (formerly DataCrunch) for proving an instance with 2 AMD EPYC 9575Fs and 8 Nvidia B200 GPUs. Verda gave us about 3 weeks with the instance to do with as we wished. While this article looks at the AMD EPYC 9575Fs, there will be upcoming coverage of the B200s found in the VM.
This system appears to be running in NPS0 mode, giving an opportunity to see how a modern server acts with 24 memory controllers providing uniform memory access.
A simple latency test immediately shows the cost of providing uniform memory access. DRAM latency rises to over 220 ns, giving a nearly 90 ns penalty over the EPYC 9355P running in NPS1 mode. It’s a high penalty compared to using the equivalent of NPS0 on older systems. For example, a dual socket Broadwell system has 75.8 ns of DRAM latency when each socket is treated as a NUMA node, and 104.6 ns with uniform memory access[1].
NPS0 mode does have a bandwidth advantage from bringing twice as many memory controllers into play. But the extra bandwidth doesn’t translate to a latency advantage until bandwidth demands reach nearly 400 GB/s. The EPYC 9355P seems to suffer when a latency test thread is mixed with bandwidth heavy ones. A bandwidth test thread with just linear read patterns can achieve 479 GB/s in NPS1 mode. However, my bandwidth test produces low values on the EPYC 9575F because not all test threads finish at the same time. I avoid this problem in the loaded memory latency test, because I have bandwidth load threads check a flag. That lets me stop all threads at approximately the same time.
Per-CCD bandwidth is barely affected by the different NPS modes. Both the EPYC 9355P and 9575F use “GMI-Wide” links for their Core Complex Dies, or CCDs. GMI-Wide provides 64B/cycle of read and write bandwidth at the Infinity Fabric clock. On both chips, each CCD enjoys more bandwidth to the system compared to standard “GMI-Narrow” configurations. For reference, a GMI-Narrow setup running at a typical desktop 2 GHz FCLK would be limited to 64 GB/s of read and 32 GB/s of write bandwidth.
Performance: SPEC CPU2017
... continue reading