AMD GPUs Go Brrr - GoKawiil

Team: William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Fu, Ryan Swann, Muhammad Osama, Christopher Ré, Simran Arora

Links: Arxiv | Code

AI is compute hungry. So we've been asking: How do we build AI from the hardware up? How do we lead AI developers to do what the hardware prefers?

AMD GPUs are now offering state-of-the-art speeds and feeds. However, this performance is locked away from AI workflows due to the lack of mature AMD software. We share HipKittens, an opinionated collection of programming primitives to help developers realize the hardware's capabilities: optimized register tiles, 8-wave and 4-wave kernel patterns instead of wave-specialization to schedule work within processors, and chiplet-optimized cache reuse patterns to schedule work across processors.

Checkout part one of this series for an intro to HipKittens and checkout this post for a technical deep dive.

What do AMD CDNA GPUs look like? A lay of the land.

An AMD MI355X GPU has 256 processors called “compute units” (CUs) and a CU contains four SIMDs. A SIMD has different execution units. A 64-thread “wave” (contrasting a 32-thread warp on NVIDIA) occupies a single SIMD. We show the MI355X memory hierarchy below.

Unsurprisingly, making AMD GPUs go brr boils down to keeping the “matrix cores” (tensor cores on NVIDIA) fed. There are a few differences in how we think about this hardware:

What it's not. An MI355X has 70% the SRAM of a B200 (165KB instead of 228KB), lacks asynchronous matrix multiplication instructions that operate on inputs in shared or tensor memory (wgmma, tcgen05), lacks register reallocation (the ability for some waves to give their registers to others), lacks tensor memory acceleration (dedicated hardware for global memory access), and lacks first class mbarrier primitives (for fine-grained synchronization). What it is. On the other hand, AMD GPUs have a 2x larger register file per processor than the B200 and offers 60% more processors per GPU (256 compute units versus 160 streaming multiprocessors). AMD offers tiny and fine-grained matrix core instructions, while NVIDIA tensor cores instructions are generally called with large input operands. AMD has a TMA-like direct global to shared memory loads via buffer_load_dword \verb|buffer_load_dword| buffer_load_dword instructions, which bypass the register file. Towards chiplet architectures. AMD is also leading the charge in the shift from monolithic grids to chiplets. AMD splits the 256 processors into 8 chiplets called “XCDs” of 32 CUs. NVIDIA B200s include 2 chips. The AMD cache is disaggregated: an AMD XCD has a private L2 cache and there is an extra last level cache (LLC) that sits between the L2 and HBM memory.

Spec NVIDIA B200 SXM5 AMD MI355X OAM BF16 matrix / tensor 2.2 PFLOPs 2.5 PFLOPs MXFP8 matrix / tensor 4.5 PFLOPs 5.0 PFLOPs MXFP6 matrix / tensor 4.5 PFLOPs 10.1 PFLOPs MXFP4 matrix / tensor 9.0 PFLOPs 10.1 PFLOPs Memory capacity 180 GB 288 GB Memory bandwidth 8.0 TB/s 8.0 TB/s

... continue reading