A Tiny Compiler for Data-Parallel Kernels

A lot of fast code starts as a boring loop.

Modern hardware can perform the same operation on multiple values at once (e.g. SIMD and SIMT), and sometimes we write code directly for those execution models but other times, a compiler starts with regular-looking code and rewrites it so multiple loop iterations can run together. I built a tiny compiler (~180LOC of Python) to understand what that transformation looks like.

My compiler lowers kernels (rewrites them into a simpler, more explicit form where data parallelism is visible). The input is a small hand-written AST, and the output is a lowered IR that I print as Python-like code. Rather than going all the way from source code to instructions, think of this compiler as an intermediate step in a larger compiler.

Let's take a look at an example. Scaling audio is easy to parallelize, but it is still common to write non-explicitly parallel code like this:

kernel scale_audio ( samples , out , n , volume ) : for i in range ( n ) : out [ i ] = samples [ i ] * volume

My compiler turns it into:

kernel scale_audio ( samples , out , n , volume ) : vector_for base in range ( 0 , n , LANES ) : let i = ( base + lane_id ) let active = ( i < n ) masked_store ( out , i , ( masked_load ( samples , i , active ) * volume ) , active )

The goal is to replace for loops with vector_for loops, which allow multiple iterations of a loop to be executed in parallel. Each position in that grouped execution is called a lane.

Lanes and Masks

A lane is one independent element position within a grouped execution. For example, if a grouped operation handles four values at once, it has four lanes:

... continue reading