TileIR Internals - GoKawiil

In this post, we’ll dig deep into how TileIR works, from how it generates instructions to analyzing its different passes. We’ll trace how a Mixture-of-Experts (MoE) kernel written in CuTile gets compiled down through cuda_tile → nv_tileaa → nv_tileas → NVVM → LLVM → SASS.

Here’s what to expect:

What is CuTile? — The tile-centric programming model

— The tile-centric programming model Running Example — An MoE kernel we’ll trace through every stage

— An MoE kernel we’ll trace through every stage The Dialects — From cuda_tile through nv_tileaa and nv_tileas to NVVM/LLVM

— From through and to NVVM/LLVM The Passes — TileIR passes: what they do and when they run

Based on CUDA 13.1. Some details are undocumented and may change in future releases.

What is CuTile?

CuTile is NVIDIA’s new “tile-centric” programming model for modern NVIDIA GPUs. This abstraction is powerful: CuTile lets the programmer think in terms of tiles rather than threads, while the compiler handles the complexity of coordinating hundreds of threads across fragmented data. A single CuTile line ct.mma(a, b, acc) could get transformed to many tensor core instructions.

What is TileIR?

... continue reading