Unweaving warp specialization on modern tensor core GPUs

Recently, I have been thinking deeply about warp specialization in the context of high performance kernels for modern Tensor Core GPUs like NVIDIA’s H100 and B200. My understanding of what warp specialization achieves has deepened and led me to the interesting question of: do we actually need warp specialization (and the complexity that it entails)? My conclusion is that the answer is indeed yes, but it might not be as mandatory as it seems. In this post, I’ll discuss when warp specialization is actually necessary, and describe the underlying trade-off space that I believe warp specialization resides within. While I will give some context on GPUs as necessary for discussing the topics at hand, this won’t be a tutorial – some experience with GPUs and parallel programming will be assumed.

Background

A GPU is a collection of processors called streaming multiprocessors (SM’s). For this discussion, we will focus on programming an individual SM. An SM is programmed with a hierarchy of threads, called a thread block. Threads in a thread block are further grouped into warps, which are groups of 32 threads. Each warp executes in a single-instruction-multiple-threads (SIMT) model. Each thread in a warp has its own instruction stream, and the warp issues one instruction on behalf of its threads in each issue slot. Performance is maximized (as discussed later) when all threads in a warp want to issue the same instruction at the same time. A Hopper SM (pictured below) has four execution contexts that can host an active warp, shown by the 4 quadrants.

At any cycle, at most 4 warps may issue instructions into the SM to execute. When a thread block contains more than 4 warps worth of threads (128), a hardware component called the warp scheduler selects 4 available warps to execute instructions.

A way to view an SM is as a collection of functional units (arithmetic units, load/store units, a Tensor Core) that are issued instructions at each clock cycle from the 4 execution contexts. These functional units have varying properties. Arithmetic units (ALU’s) perform individual math operations with short and fixed cycle latencies, Tensor Cores perform thousands of FLOPs in a single instruction with long cycle latencies, and load/store units (LSU’s) have long and unpredictable latencies due to interacting with the memory system. High performance GPU programs efficiently utilize the available functional units; compute-bound programs should use the Tensor Core and ALU’s at every clock cycle, while bandwidth-bound programs should keep the LSU’s busy to maximize bandwidth. To achieve high utilization, there must be work present for the functional units to perform (i.e. the floating point operations in a compute bound application should not be stalled waiting for loads to complete), and this available work must be issued into the functional units whenever they are available. This second aspect is where warp specialization becomes useful.

Warp Specialization

Warp specialization is a technique that became popularized through work on CUDA-DMA and the Singe Compiler, and is now a table-stakes technique for achieving high Tensor Core performance on Hopper and Blackwell GPUs. Warp specialization exploits the hierarchical grouping of threads within a thread block. When threads within the same warp diverge (i.e. branch on control flow in different ways), the SIMT nature of each warp results in performance degradation. Suppose that a warp reaches a branch where half the threads take the branch and the other half do not. The warp will now execute instructions from either side of the branch; when the warp selects an instruction from one side of the branch, the threads executing the other side do not progress. As a result, execution may take twice as long than if all threads in the warp took the same path through the branch. In the worst case, if all 32 threads in a warp take different control flow paths, the code could execute 32-times slower than the ideal! Unlike different threads within a warp, different warps within a thread block execute independently on separate execution contexts, which means that there is no cost when divergence occurs between warps. Warp specialization uses this property of warp divergence to restructure GPU programs. A standard GPU program executes the same logic on each warp, while a warp specialized program uses different warps to execute different components of the overall program. Let’s take a look at some of these warp specialization strategies in the aforementioned contexts.

The CUDA-DMA project proposed separating the loading of data from the GPU’s global (slow) memory to shared (fast) memory from the computation on data in the shared memory itself. CUDA-DMA separated the warps into memory loading warps and compute warps; the loader warps issue loads and signal the compute warps when the loaded data is available.

The Singe compiler targeted the generation of efficient combustion chemistry kernels. For the purposes of this post, these kernels essentially looked like large data-parallel computations (i.e. apply some function \(f\) to each element of an array) with a catch: computing \(f\) requires a large amount of intermediate state (numerous temporary variables in the chemical formulae). A straightforward implementation of these kernels requires too many registers to store the intermediate state and spills values to the stack, which lowers performance significantly. The annoying bit here is that the SM’s register file has enough space to store all the temporaries. However, the architecture provides each thread with a fixed number of accessible registers (for example, 255 per thread on Hopper). Singe used warp specialization to bypass the register-per-thread limit by partitioning the computation of \(f\) onto different warps. Concretely, suppose \(f(x) = 1 + x + 2 \cdot x + x^2 + 8 \cdot x^3\). Assuming a small register-per-thread budget, a warp specialized implementation of \(f\) might place the computation of \(1 + x + 2\cdot x\) onto warp one, and place \(x^2 + 8\cdot x^3\) onto warp two; the two warps would then communicate to sum the intermediate values.

Finally, warp specialization is used in high performance Tensor Core kernels targeting Hopper and Blackwell to interact with the accelerators appearing within the SM. On these GPUs, the SM contains accelerators that perform matrix multiplication (Tensor Core) and data movement to/from global memory (Tensor Memory Accelerator, or TMA). These accelerators offer instructions to multiply tiles of data or copy tiles of data to and from global memory. These accelerators are also asynchronous, where work on the accelerator is launched by a single instruction and then a blocking “wait” operation must be issued before using the results of the instruction. Specialized warps are used on Hopper and Blackwell to issue either TMA copies or Tensor Core matrix-multiplies. The TMA warp issue copies and notifies the Tensor Core warps when data is ready to be multiplied, and the Tensor Core warps notify the TMA warp when data has been consumed and the memory is free to use for more copies. This code looks something like:

... continue reading