Published on: 2025-05-20 15:55:21
Introduction Hi everyone ! In this post, I will share with you all the steps to write an optimized FP32 matrix multiplication on AMD RDNA3 GPU outperforming rocBLAS by 60%. I will cover some basics and explain all the optimizations I have implemented. This will be done in a iterative way in 8 differents Kernels. Figure 1: sneak peek of the performance results I primary intended to work on this to deepen my understanding of RDNA3 and try out HIP and I felt like I needed to share what I learne
Keywords: figure kernel lds performance v_dual_fmac_f32
Find related items on AmazonGo K’awiil is a project by nerdhub.co that curates technology news from a variety of trusted sources. We built this site because, although news aggregation is incredibly useful, many platforms are cluttered with intrusive ads and heavy JavaScript that can make mobile browsing a hassle. By hand-selecting our favorite tech news outlets, we’ve created a cleaner, more mobile-friendly experience.
Your privacy is important to us. Go K’awiil does not use analytics tools such as Facebook Pixel or Google Analytics. The only tracking occurs through affiliate links to amazon.com, which are tagged with our Amazon affiliate code, helping us earn a small commission.
We are not currently offering ad space. However, if you’re interested in advertising with us, please get in touch at [email protected] and we’ll be happy to review your submission.