The math behind tiled v/s naive matrix multiplication in CUDA

from Guide to Machine Learning on Apr 30, 2023

How to tile matrix multiplication

Matrix multiplication is a staple of deep learning and a well-studied, well-optimized operation. One of the most common optimizations for matrix multiplication is called "tiling," but as common and important as it is, it's a bit confusing to understand.

Tiling matrix multiplication is a valuable technique that optimizes resource utilization in multiple dimensions, including power, memory, and compute. Critically, tiling then reduces overall latency, making this vital for models heavily reliant on dense matrix multiplication.

One such example is transformers and their associated Large Language Models; their heavy reliance on dense matrix multiplies for inference makes tiling an important concept to understand — and to leverage.

Not sure why dense matrix multiplies are so necessary? For a primer on how Large Language Models work, check out the 3-part series, beginning with Language Intuition for Transformers.

In this post, we'll break down how tiling for matrix multiplication works, again by conveying intuition primarily through illustrations.

I'll start with a description of how to tile a single matrix multiply. Here we only cover the most salient parts at a high level.

Let's multiply two matrices $A$ and $B$ normally. To do so, we take the inner product of all the rows in $A$ and the columns in $B$. We illustrate this below.

Here's what that process looks like in more detail:

... continue reading