Tech News
← Back to articles

AMD's EPYC 9355P: Inside a 32 Core Zen 5 Server Chip

read original related products more articles

High core count chips are headline grabbers. But maxing out the metaphorical core count slider isn’t the only way to go. Server players like Intel, AMD, and Arm aim for scalable designs that cover a range of core counts. Not all applications can take advantage of the highest core count models in their lineups, and per-core performance still matters.

From AMD’s Turin whitepaper

AMD’s EPYC 9355P is a 32 core part. But rather than just being a lower core count part, the 9355P pulls levers to let each core count for more. First, it clocks up to 4.4 GHz. AMD has faster clocking chips in its server lineup, but 4.4 GHz is still a good bit higher than the 3.7 or 4.1 GHz that 128 or 192 core Zen 5 SKUs reach. Then, AMD uses eight CPU dies (CCDs) to house those 32 cores. Each CCD only has four cores enabled out of the eight physically present, but still has its full 32 MB of L3 cache usable. That provides a high cache capacity to core count ratio. Finally, each CCD connects to the IO die using a “GMI-Wide” setup, giving each CCD 64B/cycle of bandwidth to the rest of the system in both the read and write directions. GMI here stands for Global Memory Interconnect. Zen 5’s server IO die has 16 GMI links to support up to 16 CCDs for high core count parts, plus some xGMI (external) links to allow a dual socket setup. GMI-Wide uses two links per CCD, fully utilizing the IO die’s GMI links even though the EPYC 9355P only has eight CCDs.

Acknowledgments

Dell has kindly provided a PowerEdge R6715 for testing, and it came equipped with the aforementioned AMD EPYC 9355P along with 768 GB of DDR5-5200. The 12 memory controllers on the IO die provide a 768-bit memory bus, so the setup provides just under 500 GB/s of theoretical bandwidth. Besides providing a look into how a lower core count SKU behaves, we have BMC access which provides an opportunity to investigate different NUMA setups.

The provided Dell PowerEdge R6715 system racked and ready for testing. Credit for the image goes to Zach from ZeroOne.

We’d also like to thank Zack and the rest of the fine folks at ZeroOne Technology for hosting the Dell PowerEdge R6715 at no cost to us.

Memory Subsystem and NUMA Characteristics

NPS1 mode stripes memory accesses across all 12 of the chip’s memory controllers, presenting software with a monolithic view of memory at the cost of latency. DRAM latency in that mode is slightly better than what Intel’s Xeon 6 achieves in SNC3 mode. SNC3 on Intel divides the chip into three NUMA nodes that correspond to its compute dies. The EPYC 9355P has good memory latency in that respect, but it falls behind compared to the Ryzen 9 9900X with DDR5-5600. Interconnects that tie more agents together tend to have higher latency. On AMD’s server platform, the Infinity Fabric network within the IO die has to connect up to 16 CCDs with 12 memory controllers and other IO, so the higher latency isn’t surprising.

Cache performance is similar across AMD’s desktop and server Zen 5 implementations, with the server variant only losing because of lower clock speeds. That’s not a surprise because AMD reuses the same CCDs on desktop and server products. But it does create a contrast to Intel’s approach, where client and server memory subsystems differ starting at L3. Intel trades L3 latency for capacity and the ability to a logical L3 instance across more cores.

... continue reading