Find Related products on Amazon

Shop on Amazon

Optimizing Matrix Multiplication on RDNA3

Published on: 2025-05-20 15:55:21

Introduction Hi everyone ! In this post, I will share with you all the steps to write an optimized FP32 matrix multiplication on AMD RDNA3 GPU outperforming rocBLAS by 60%. I will cover some basics and explain all the optimizations I have implemented. This will be done in a iterative way in 8 differents Kernels. Figure 1: sneak peek of the performance results I primary intended to work on this to deepen my understanding of RDNA3 and try out HIP and I felt like I needed to share what I learned doing this :). Few things I like to say before we start : All the information I used comes from the publicly available ISA guide I don’t intend to re-implement or replace rocBLAS I only focused on 4096x4096 matrices single precision (FP32) matrix multiplication for the sake of simplicity. All my tests were done on Windows 11 with a AMD Radeon 7900 XTX. That being said, let’s start ! Problem statement There is a lot of research happening on the way to improve the performance of matrix m ... Read full article.