Optimizing Matrix Multiplication on RDNA3
Published on: 2025-05-20 15:55:21
Introduction
Hi everyone !
In this post, I will share with you all the steps to write an optimized FP32 matrix multiplication on AMD RDNA3 GPU outperforming rocBLAS by 60%. I will cover some basics and explain all the optimizations I have implemented. This will be done in a iterative way in 8 differents Kernels.
Figure 1: sneak peek of the performance results
I primary intended to work on this to deepen my understanding of RDNA3 and try out HIP and I felt like I needed to share what I learned doing this :).
Few things I like to say before we start :
All the information I used comes from the publicly available ISA guide
I don’t intend to re-implement or replace rocBLAS
I only focused on 4096x4096 matrices single precision (FP32) matrix multiplication for the sake of simplicity.
All my tests were done on Windows 11 with a AMD Radeon 7900 XTX.
That being said, let’s start !
Problem statement
There is a lot of research happening on the way to improve the performance of matrix m
... Read full article.