Tech News
← Back to articles

HipKittens: Fast and furious AMD kernels

read original related products more articles

Team: William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Fu, Ryan Swann, Muhammad Osama, Christopher Ré, Simran Arora

Links: Arxiv | Code

AI is gated by hardware. We think that opening up AI’s compute landscape is one of the most important problems to be working on right now. Building towards this goal, we present HipKittens: SoTA AMD kernels and a collection of opinionated programming primitives to make AMD kernel dev easier!

Named after AMD's CUDA equivalent, called HIP.

Building towards multi-silicon AI systems

While AI has largely used a single hardware vendor to get to its current stage, AMD GPU hardware now offers state-of-the-art peak compute and memory bandwidth. However, this performance is locked away from AI workflows due to the lack of mature AMD software.

Spec NVIDIA B200 SXM5 AMD MI355X OAM BF16 matrix / tensor 2.2 PFLOPs 2.5 PFLOPs MXFP8 matrix / tensor 4.5 PFLOPs 5.0 PFLOPs MXFP6 matrix / tensor 4.5 PFLOPs 10.1 PFLOPs MXFP4 matrix / tensor 9.0 PFLOPs 10.1 PFLOPs Memory capacity 180 GB 288 GB Memory bandwidth 8.0 TB/s 8.0 TB/s

Table 1: Hardware overview. Peak memory and compute speeds for the latest generation GPU platforms.

The AMD software ecosystem includes AITER, a high performance AI kernel library; PyTorch and a few compilers (Triton, Mojo, TileLang); and Composable Kernel (CK), AMD's C++ based programming model for writing kernels. However, despite gigawatt-scale AMD deployments, the software remains brittle.

The existing software offerings fail to consistently achieve peak performance. CK kernels frequently underperform (see our evaluations below). AITER and PyTorch are volatile; for instance, AITER and PyTorch SDPA Llama GQA backwards kernels achieve just 30% and 24% of SoTA performance respectively on AMD MI355X GPUs. And the compilers currently significantly sacrifice performance and have not yet demonstrated reusable programming primitives for AMD. Further, we find that some critical aspects of hardware functionality around bank conflict avoidance are undocumented in the CDNA ISA, discussed in our technical deep dive blog.

... continue reading