Accelerated Game of Life with CUDA / Triton

Let’s look at implementing Conway’s Game of Life using a graphics card. I want to experiment with different libraries and techniques, to see how to get the best performance. I’m going to start simple, and get increasingly complex as we dive in.

The Game Of Life is a simple cellular automata, so should be really amenable to GPU acceleration. The rules are simple: Each cell in the 2d grid is either alive or dead. At each step, count the alive neighbours of the cell (including diagonals). If the cell is alive, it remains alive if 2 or 3 neighbours are alive. Otherwise it dies. If the cell is dead, it comes ot life if exactly 3 neighbours are alive. These simple rules cause an amazing amount of emergent complexity which has been written about copiously elsewhere.

For simplicity, I’ll only consider N×N grids, and skip calculations on the boundary. I ran everything with an A40, and I’ll benchmark performance at N=216 . For now, we’ll store each cell as 1 byte so this array is which equates to 4 GB of data.

All code is shared in the GitHub repo.

Theoretical Limit

An A40 has a memory bandwidth of 696 GB/s. That is the speed of transferring data from DRAM to the cores that actually perform calculation. To calculate the cell, the GPU needs to load at least one byte and store one byte when done. So, considering nothing but memory transfer, we need at least 4 GB * 2 / 696 GB/s = 11.5 ms.

As the game’s rules are relatively simple, we’re rarely going to be able to saturate the compute capabilities of the GPU, which is the other common limiting factor. So 11.5 ms is the target we want to get as close as possible to.

In Pytorch

Pytorch is a popular Python library for Machine Learning that comes with a useful set of operations. Torch has built-in support for convolutions, but they are designed for floating-point and include unneeded multiplications. Instead I use a separable box blur to compute the number of alive cells in each 3×3 rectangle. From there some simple logic covers the game of life rules.

def gol_torch_sum(x: torch.Tensor) -> torch.Tensor: y = x[2:] + x[1:-1] + x[:-2] z = y[:, 2:] + y[:, 1:-1] + y[:, :-2] z = torch.nn.functional.pad(z, (1, 1, 1, 1), value=0) return ((x == 1) & (z == 4)) | (z == 3).to(torch.int8)

... continue reading