Async/Await on the GPU

At VectorWare, we are building the first GPU-native software company. Today, we are excited to announce that we can successfully use Rust's Future trait and async / await on the GPU. This milestone marks a significant step towards our vision of enabling developers to write complex, high-performance applications that leverage the full power of GPU hardware using familiar Rust abstractions.

Concurrent programming on the GPU

GPU programming traditionally focuses on data parallelism. A developer writes a single operation and the GPU runs that operation in parallel across different parts of the data.

fn conceptual_gpu_kernel (data) { // All threads in all warps do the same thing to different parts of data data[thread_id] = data[thread_id] * 2 ; }

This model works well for standalone and uniform tasks such as graphics rendering, matrix multiplication, and image processing.

As GPU programs grow more sophisticated, developers use warp specialization to introduce more complex control flow and dynamic behavior. With warp specialization, different parts of the GPU run different parts of the program concurrently.

fn conceptual_gpu_kernel (data) { let communication = ... ; if warp == 0 { // Have warp 0 load data from main memory load (data, communication); } else if warp == 1 { // Have warp 1 compute A on loaded data and forward it to B compute_A (communication); } else { // Have warp 2 and 3 compute B on loaded data and store it compute_B (communication, data); } }

Warp specialization shifts GPU logic from uniform data parallelism to explicit task-based parallelism. This enables more sophisticated programs that make better use of the hardware. For example, one warp can load data from memory while another performs computations to improve utilization of both compute and memory.

This added expressiveness comes at a cost. Developers must manually manage concurrency and synchronization because there is no language or runtime support for doing so. Similar to threading and synchronization on the CPU, this is error-prone and difficult to reason about.

Better concurrent programming on the GPU

... continue reading