GB10 is a collaboration between Nvidia and Mediatek that brings Nvidia’s Blackwell architecture into an integrated GPU. GB10’s GPU has 48 Blackwell SMs, matching the RTX 5070 in core count. The CPU side has 10 Cortex X925 and 10 Cortex A725 cores and is therefore quite powerful. Feeding all of that compute power requires a beefy memory subsystem, and can lead to difficult tradeoffs. Analyzing GB10’s memory subsystem from the CPU side will be the focus of this article. To keep article length manageable, I’ll further focus on Nvidia and Mediatek’s memory subsystems and design decisions. Core architecture and GB10’s GPU will be an exercise for another time.
Acknowledgments
We’d like to thank Zach at ZeroOne Technology for allowing us SSH access to his DGX Spark unit for CPU testing.
SOC Layout
CPU cores on GB10 are split into two clusters. Each cluster has five A725 cores and five X925 cores. Core numbering starts with the A725 cores within each cluster, and the two clusters come after each other. All of the A725 cores run at 2.8 GHz. X925 cores clock up to 3.9 GHz on the first cluster, and up to 4 GHz on the second.
Cache and Memory Access
Arm’s A725 and X925 have configurable cache capacities. GB10 opts for 64 KB L1 instruction and data caches on both cores. All A725 cores get 512 KB L2 caches, and all X925 cores get 2 MB of L2. A725’s L2 is 8-way set associative and offers latency at just 9 cycles. In actual time, that comes out to 3.2 nanoseconds and is good considering the low 2.8 GHz clock speed. However, L3 latency is poor at over 21 ns, or >60 cycles.
Testing cores across both clusters indicates that the first CPU cluster has 8 MB of L3, while the second has 16 MB. I’ll refer to these as Cluster 0 and Cluster 1 respectively. Both clusters have the same L3 latency from an A725 core, despite the capacity difference. 512 KB isn’t a lot of L2 capacity when L3 latency is this high. Likely, selecting the 512 KB L2 option reduces core area and lets GB10 implement more cores. Doing so makes sense considering that A725 cores aren’t meant to individually deliver high single threaded performance. That task is best left to the X925 cores.
GB10’s X925 cores have 2 MB, 8-way set associative L2 caches with 12 cycle latency. L3 latency is surprisingly much better at ~56 cycles or ~14 ns, even though the A725 and X925 cores share the same L3. While it’s not a spectacular L3 latency result, it’s at least similar to Intel’s Arrow Lake L3 in nanosecond terms. Combined with the larger L2, that gives GB10’s X925 cores a cache setup that’s better balanced to deliver high performance.
A 16 MB system level cache (SLC) sits after L3. It’s hard to see from latency plots due to its small capacity relative to the CPU L3 caches. Latency data from cluster 0 suggests SLC latency is around 42 or 47 ns, depending on whether it’s accessed from a X925 or A725 core respectively. System level caches aren’t tightly coupled to any compute block, which typically means lower performance in exchange for being able to service many blocks across the chip. Nvidia states that the system level cache “enables power-efficient data sharing between engines” in addition to serving as the CPU’s L4 cache. Allowing data exchange between the CPU and GPU without a round trip to DRAM may well be the SLC’s most important function.
... continue reading