Zero-Copy GPU Inference from WebAssembly on Apple Silicon

tl;dr: on Apple Silicon, a WebAssembly module's linear memory can be shared directly with the GPU: no copies, no serialization, no intermediate buffers. The CPU and GPU read and write the same physical bytes. End-to-end, it works: a Wasm guest fills a matrix in its linear memory, the GPU reads it, computes, writes back, and the guest sees the result through the same pointer, same memory, zero copies.

Normally Wasm and GPUs are separated by an expensive serialization boundary: on most hardware, getting data from a VM sandbox to an accelerator means copying across a bus. Apple Silicon's Unified Memory Architecture erases that boundary (no bus, same physical memory), and what falls out is a runtime where Wasm is the control plane and the GPU is the compute plane, with near-zero overhead between them.

I'm building something called Driftwood that exploits this for stateful AI inference ... and this post is about the foundation (how the zero-copy chain works, what I measured, what it opens up). Still early, still poking at it.

Why this is normally hard

Quick background, for anyone who doesn't live in this stack: WebAssembly gives you a sandbox. Your module gets a flat byte array (linear memory) and that's the universe ... everything outside is mediated by "host" function calls. The whole point is isolation, portability, determinism.

GPUs also want a flat byte array, but a specific kind: page-aligned, pinned, accessible to the DMA engine. On a discrete GPU (think NVIDIA, or AMD), that memory sits across a PCIe bus from the CPU, so getting data from a Wasm module's linear memory to the GPU means: copy out of the sandbox into host memory, then copy across the bus into GPU memory. Two copies, two latency hits, and an awkward impedance mismatch between "isolated VM" and "hardware accelerator."

Apple Silicon changes the physics. The CPU and GPU share the same physical memory (Apple's Unified Memory Architecture) ... no bus! A pointer the CPU can read, the GPU can also read, from the same DRAM. The real question: can you thread that pointer through the layers of abstraction (the Wasm runtime, the GPU API) without anyone making a defensive copy along the way?

Turns out ... you can!

The three-link chain

Three links. I validated each one on its own before trying to compose them: it's the kind of thing where if you skip the isolation step and the whole pipeline breaks, you have no idea "which joint is leaking".

... continue reading