Using the Matrix Cores of AMD RDNA 4 architecture GPUs

AMD RDNA 4 architecture GPUs

AMD RDNA™ 4 architecture GPUs, which have 3rd-generation Matrix Cores, improved the performance of Generalized Matrix Multiplication (GEMM) operations. The table below compares theoretical FLOPS/clock/CU (floating point operations per clock, per compute unit) to previous generations. However, we changed the VGPR layout for the arguments of Wave Matrix Multiply Accumulate (WMMA) operations compared to the previous RDNA 3 generation [1]. Therefore, it does not have backward compatibility. In order to accelerate GEMM operations on RDNA 4 GPUs, we need to use the new intrinsics added for GPUs of this generation. In this post, we explain how to use matrix cores on RDNA 4 GPUs from a HIP kernel.

AMD Radeon™ RX 6950 XT (RDNA 2) AMD Radeon™ RX 7900 XTX (RDNA 3) AMD Radeon™ RX 9070 XT (RDNA 4) FP16 256 512 1024 BF16 N/A 512 2048 I8 512 512 2048

Wave Matrix Multiply Accumulate (WMMA) on AMD RDNA 4

Before going into the detail of the AMD RDNA 4 architecture implementation, let’s do a refresher on the matrix operation. A GEMM operation can be written as follows:

D = A B + C (1) D=AB+C \tag{1} D = A B + C ( 1 )

WMMA operates on matrices of 16x16 dimension only. Thus, if the matrix we are working on is smaller than that, the matrices need to be padded. If the matrix is larger than that, we can still use WMMA intrinsics by decomposing the larger matrix into GEMM operations of 16x16 matrices.

This GEMM operation can be implemented easily in a kernel. A single lane of a wavefront simply allocates multiple 16x16 matrices in VGPR, loads them into the VGPRs, and then just executes the operation. If this operation is executed in all the lanes in a wavefront, it’s obviously very inefficient because they all load the same matrices. WMMA changes it and simplifies the operation. The WMMA intrinsic is different from other operations we do in a HIP kernel. Instead of writing code for each lane in a wavefront, it is required to write code to use the entire wavefront.

First, it changes the VGPR allocation. Instead of allocating the entire matrix in a single lane, each lane allocates smaller VGPRs to store part of the matrix. This reduces VGPR pressure. The other point is that once the matrices are loaded into VGPRs, a WMMA intrinsic is called from all the lanes in a wavefront, which triggers the execution of the GEMM operation in the matrix core.

There are a few WMMA intrinsics, but here in this blog, we focus on the WMMA intrinsic taking A, B as 16-bit floating-point numbers, and D, C as 32-bit floating-point numbers, which is __builtin_amdgcn_wmma_f32_16x16x16_f16_w32_gfx12 .

... continue reading