Multiplatform Matrix Multiplication Kernels

Few algorithmic problems are as central to modern computing as matrix multiplication. It is fundamental to AI, forming the basis of fully connected layers used throughout neural networks. In transformer architectures, most of the computation is spent performing matrix multiplication. And since compute largely determines capability, faster matrix multiplication algorithms directly translate into more powerful models [1 ]. NVIDIA probably deserves much of the credit for making matrix multiplication fast. Their early focus on building GPUs for gaming drove major improvements in general-purpose linear algebra. As deep learning took over, they introduced Tensor Cores — specialized units designed to accelerate matrix multiplication in AI models. With hardware evolving to support these new workloads, accelerators have become increasingly complex, and writing high-performance code for them remains challenging. To ease this burden, hardware vendors like NVIDIA, AMD and Intel provide off-the-shelf matrix multiplication kernels, typically integrated into tensor libraries. However, the quality and performance of these kernels vary. While NVIDIA’s implementations, through libraries like cuBLAS [2 ] and cuDNN [3 ] , are highly optimized, a key limitation remains: as precompiled binaries, they lack flexibility and extensibility. In modern AI workloads, the primary bottleneck is no longer computation. It's data movement. Moving data from memory to registers often costs more than performing the actual calculations. The best way to optimize for this is quite obvious: minimize data movement. A powerful way to achieve this is to fuse multiple kernels or algorithms into a single kernel [4 ]. For matrix multiplication, it means executing element-wise operations on the values calculated just before writing them to global memory. Pre-built matrix multiplication kernels don't support this kind of composition, making custom kernel implementations necessary. This is why NVIDIA created CUTLASS [5 ]: a set of C++ templates that help developers tailor matrix-multiplication kernels to their needs. However, it remains very intricate to use and, of course, only works on NVIDIA GPUs. As a result, using CUTLASS to optimize a model often comes at the cost of portability. But wouldn't it be better to optimize models across platforms? That’s why we built CubeCL [6 ], and on top of it, a matrix multiplication kernel engine [7 ] that searches for and generates optimized kernels for any GPU and even CPUs. Not Just an Algorithm One of the main challenges with matrix multiplication is that the input shapes strongly influence which algorithm performs best. As a result, no single algorithm is optimal in all cases; different shapes require different strategies. To address this, we built an engine that actually makes it very easy to create different algorithms that are totally configurable. At its core is a multi-level matrix multiplication architecture that we'll explore in this blog. ...Or, skip the theory and jump straight to benchmarks . A Bit on the Hardware First Before diving into the solution, let's briefly review the architecture of modern GPUs. A GPU consists of multiple streaming multiprocessors (SMs), each capable of independently executing parts of a program in parallel. Each SM has a fixed set of resources shared across multiple concurrent executions. There are three main levels of execution granularity: Unit (thread in CUDA, invocation in Vulkan/Wgpu): the smallest execution entity performing computations. Plane (warp in CUDA, subgroup in Vulkan/Wgpu): a group of (typically 32) units executing in lockstep and able to share data efficiently through registers. Cube (thread block in CUDA, workgroup in Vulkan/Wgpu): a group of units that execute on the same SM, sharing memory and able to synchronize.

Grasping the concept of planes can be challenging when first programming GPUs, because it’s extremely implicit. One of the GPU's parallel axes is actually more efficient when executed in sync, but this is generally not reflected in the code. Units within the same plane perform more efficiently when they execute the same instruction simultaneously, whereas branching is more acceptable between different planes. Additionally, data access is faster when each unit, in turn, loads from consecutive addresses, allowing a single instruction to fetch a contiguous block of memory. This is known as coalesced memory access and is central to most memory optimizations. As a result, many high-performance computing kernels treat the plane—not the individual unit—as the smallest computation primitive, in order to avoid branching and maximize performance. Memory Resources Different Cubes cannot synchronize directly because they may run on different SMs (although starting with SM 9.0, SMs within a cluster can share memory enabling limited inter-SM sync). However, a single SM can run multiple Cubes concurrently, depending on the resources each Cube requires. But what exactly are these resources? Shared memory: Memory that is visible to a whole Cube to store intermediate values, and whose memory can be synchronized. It is faster to access than global memory, but less than registers. It is often used as a manually-managed cache for temporary data reuse, or to facilitate communication between planes. Registers: Normally, an SM has a fixed total number of registers that are shared among multiple concurrent Cubes. Therefore, if your algorithm requires many registers per Cube, it limits the number of Cubes that can reside simultaneously on the same SM. The fraction of the SM's capacity that is actively used by running Cubes is called occupancy. High occupancy generally helps the GPU hide memory latency by allowing the hardware to switch between active Cubes when one is stalled waiting for data. In other words, higher occupancy can help mask stalls caused by memory access delays. However, maximizing occupancy isn't always possible, nor is it always the best way to avoid stalls. Some optimization techniques trade off hardware scheduling for software-managed pipelining, using more registers to keep data flowing continuously and reduce the impact of memory stalls.

With these hardware considerations in mind, let's return to the problem at hand. Matrix Multiplication Fundamentals Problem Definition Matrix multiplication takes two input tensors: Left-hand side (Lhs) of shape [b, m, k] and Right-hand side (Rhs) of shape [b, k, n]. For each batch b, the m rows of Lhs are multiplied with the n columns of Rhs along the shared k dimension, producing an output tensor of shape [b, m, n]. While tensors can have multiple batch dimensions that differ between Lhs and Rhs (resolved through broadcasting), we'll focus on the simpler case with a single shared batch dimension, without loss of generality. Matrix multiplication takes two input tensors: Left-hand side (Lhs) of shape [b, m, k] and Right-hand side (Rhs) of shape [b, k, n]. For each batch b, the m rows of Lhs are multiplied with the n columns of Rhs along the shared k dimension, producing an output tensor of shape [b, m, n]. While tensors can have multiple batch dimensions that differ between Lhs and Rhs (resolved through broadcasting), we'll focus on the simpler case with a single shared batch dimension, without loss of generality.

Matmul Problem. Representation of an (m, n, k) matrix multiplication problem in row-major layout. For every pair (i, j) in the output, we must compute the accumulated sum over k' in [0, k) of Lhs(i, k') × Rhs(k', j). In this blog post, we will use a (384, 224, 64)-Matmul as an example.

Let's set aside batching and focus on the core problem of computing the matrix product Lhs × Rhs with shapes [m, k] × [k, n], which we'll refer to as an (m, n, k)-Matmul.

A naive CPU implementation would iterate over the m and n dimensions, computing dot products along k for each (m, n) pair. As illustrated in the above figure, each output element results from a dot product (or inner product) involving k multiply-add operations. This process is repeated m × n times—once for each output element.

The same result can also be computed as a sum of k outer products. An outer product is what you get when you multiply a column vector by a row vector — it gives you all combinations of their elements, filling an entire matrix. This approach requires maintaining all m × n intermediate results (accumulators) at once, but avoids reloading the same data multiple times. Quite interestingly, we will actually use both strategies, depending on the level of tiling.

Complexity

A matrix multiplication consists of m × n dot products, each involving vectors of length k. Each dot product requires k multiply-and-add operations, which modern hardware can often fuse into single instructions. Although the Strassen algorithm can theoretically reduce the number of operations, its practical implementation usually results in fewer fused multiply-add instructions, making it less efficient in practice [8 ]. Therefore, we consider 2 × b × m × n × k as the fundamental operation count. Dividing this operation count by execution time yields the achieved number of TFLOPs, a measure we will use in the benchmarks section to compare our algorithms.

This means that given optimal hardware instruction usage, there exists a theoretical performance ceiling. Our goal in optimizing Matmul at the software level is to approach this ceiling as closely as possible. In other words, matrix multiplication is compute-bound, and our challenge lies in minimizing or hiding global memory access latency.

... continue reading