What happens when you run a CUDA kernel?

Here’s a simple CUDA program. It adds two vectors.

__global__ void vadd ( const float* a, const float* b, float* c, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) c[i] = a[i] + b[i]; } int main () { int n = 1 << 20 ; // a million floats (1,048,576) size_t bytes = n * sizeof ( float ); float * a = ( float* ) malloc (bytes), * b = ( float* ) malloc (bytes), * c = ( float* ) malloc (bytes); for ( int i = 0 ; i < n; i ++ ) a[i] = b[i] = 1.0 f ; float * da, * db, * dc; cudaMalloc ( & da, bytes); cudaMalloc ( & db, bytes); cudaMalloc ( & dc, bytes); cudaMemcpy (da, a, bytes, cudaMemcpyHostToDevice); cudaMemcpy (db, b, bytes, cudaMemcpyHostToDevice); vadd <<< 4096 , 256 >>> (da, db, dc, n); // 4096 * 256 = n threads, one per float cudaMemcpy (c, dc, bytes, cudaMemcpyDeviceToHost); printf ( "c[0]= %f c[n-1]= %f

" , c[ 0 ], c[n - 1 ]); }

Compiled for an RTX 4090, and launched, it does correctly work out that 1 + 1 = 2 1+1=2 1+1=2, a million times I didn’t check all of them..

$ nvcc -arch=sm_89 -o vadd vadd.cu && ./vadd c[0]=2.000000 c[n-1]=2.000000

Telling you that involved tens of millions of CPU instructions, a couple of device files, nine hundred ioctls, and one memory-mapped doorbell register. In this post, we’ll follow this one kernel from the code down to the warps, and back up to the answer An aside, this post is an instance of the ‘legibility transition’ that agents have engendered. There really is very little about computers you can’t find out with curiosity and (machine-enhanced) persistence. An interesting discussion of the implications of legibility for what AI can help us to know here..

Compiling our program with nvcc §

We ought to start with how to turn this CUDA program into something that the device can actually read. To do that we need a compiler. Really, we need many compilers.

nvcc is a driver program that runs several other compilers and combines their output. If you pass --keep it leaves the whole pipeline on disk for you to read:

$ nvcc --keep -arch=sm_89 -o vadd vadd.cu && ls ... vadd.ptx # device code as PTX (from cicc) vadd.sm_89.cubin # device code as SASS (from ptxas) vadd.fatbin # cubin + PTX, bundled (from fatbinary) vadd.cudafe1.stub.c # host launch stub + kernel registration vadd.o # final host object, fatbin embedded ...

... continue reading