Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust

cuTile Rust ( cutile-rs ) is a tile-based system for writing memory-safe, data-race-free GPU kernels in idiomatic Rust. It extends Rust's ownership discipline across the GPU launch boundary: mutable tensors are partitioned into disjoint pieces before launch, immutable tensors are shared, and generated launchers preserve ownership while GPU work is in flight. The same model supports synchronous launches, asynchronous pipelines, and CUDA graph replay. The #[cutile::module] macro embeds a captured Rust AST for each kernel in the host binary; when a kernel is needed, cuTile Rust JIT-compiles that AST through CUDA Tile IR into a GPU cubin. Local opt-outs remain available when lower-level control is needed.

Project Status

We are excited to release this research project as a demonstration of how GPU programming can be made available in the Rust ecosystem. The software is in an early stage and under active development: you should expect bugs, incomplete features, and API breakage as we work to improve it. That being said, we hope you'll be interested to try it in your work and help shape its direction by providing feedback on your experience.

Please check out CONTRIBUTING.md if you're interested in contributing.

Quick Start

use cutile :: prelude :: * ; # [ cutile :: module ] mod kernel { use cutile :: core :: * ; # [ cutile :: entry ( ) ] fn add < const B : i32 > ( z : & mut Tensor < f32 , { [ B ] } > , x : & Tensor < f32 , { [ - 1 ] } > , y : & Tensor < f32 , { [ - 1 ] } > , ) { let tx = load_tile_like ( x , z ) ; let ty = load_tile_like ( y , z ) ; z . store ( tx + ty ) ; } } fn main ( ) -> Result < ( ) , Error > { let x = api :: ones :: < f32 > ( & [ 1024 ] ) ; let y = api :: ones :: < f32 > ( & [ 1024 ] ) ; let z = api :: zeros :: < f32 > ( & [ 1024 ] ) . partition ( [ 128 ] ) ; let ( _z , _x , _y ) = kernel :: add ( z , x , y ) . sync ( ) ? ; Ok ( ( ) ) }

The #[cutile::module] macro transforms add into a GPU kernel and generates a host-side launcher. The host code constructs lazy tensor operations, partitions the mutable output into 128-element chunks, and calls .sync() to JIT-compile and execute the kernel.

The kernel signature carries the access discipline into device code: z is the exclusive mutable output, while x and y are shared read-only inputs. The body loads input tiles matching the output partition, adds them, and stores the result. The launch grid (8, 1, 1) is inferred from the partition: 1024÷128 = 8 tiles.

Run a similar example via cargo run -p cutile-examples --example saxpy .

. More kernels and usage examples of the host-side API can be found here.

... continue reading