A Gentle Introduction to CUDA PTX

Introduction

As a CUDA developer, you might not interact with Parallel Thread Execution (PTX) every day, but it is the fundamental layer between your CUDA code and the hardware. Understanding it is essential for deep performance analysis and for accessing the latest hardware features, sometimes long before they are exposed in C++. For example, the wgmma ↗ instructions, which perform warpgroup-level matrix operations and are used in some of the most performant GEMM kernels, are available only through PTX instructions.

This post serves as a gentle introduction to PTX and its place in the CUDA ecosystem. We will set up a simple playground environment and walk through a complete kernel in PTX. My goal is not only to give you the foundation to use PTX but to also share my mental model of how PTX fits into the CUDA landscape.

PTX and the CUDA ecosystem

Every processor has an instruction set architecture (ISA), which is the specific set of commands the hardware can execute. NVIDIA GPUs are no different, their native, hardware-specific ISA is called SASS (streaming assembly). However, the SASS for one GPU generation can be incompatible with another, meaning a program compiled for an older GPU might not run on a newer one. In other words, there is no forward compatibility. This is one of the problems that PTX solves. PTX is an ISA for a virtual machine: an abstract GPU that represents the common features of all NVIDIA hardware. When you compile your CUDA code, a tool called ptxas ↗ translates your hardware-agnostic PTX into the specific SASS for your target GPU. This two-stage design is a common pattern in modern compilers. The LLVM (Low Level Virtual Machine) project↗ is a well-known example of this architecture.

We can utilize the PTX forward compatibility by using just-in-time (JIT) compilation. You can choose to embed the PTX code directly into your final executable (I will cover this later in the post). When your application runs on a new GPU for which it doesn’t have pre-compiled SASS, the NVIDIA driver on the system acts as a JIT compiler. It’s important to note that this provides forward compatibility only. For example, PTX generated for compute_70 can run on any future GPU (8.x, 9.x, etc.), but it cannot be run on an older 6.x GPU. This is different from the SASS binary itself, which has much stricter rules and is generally only compatible with GPUs of the same major version number. Tools like Triton↗ rely on this. They generate PTX and leave the final, hardware-specific compilation to the driver. By default, nvcc includes both PTX and SASS in your executable, giving you both immediate performance and future compatibility.

The PTX playground

The best way to learn PTX is to see it in action. To do that, we need a simple environment that lets us write PTX code and see it run. I have created precisely that in this repository↗.

The repository contains two main files:

add_kernel.ptx ↗ : This is the text file that contains the raw PTX instructions for our kernel. main.cu ↗ : This is a C++ program that runs on the host to load, run, and verify the result of our PTX kernel.

... continue reading