Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels

Speeding up PyTorch inference by 87% on Apple devices with AI-generated Metal kernels

tl;dr: Our lab investigated whether frontier models can write optimized GPU kernels for Apple devices to speed up inference. We found that they can: our AI-generated Metal kernels were 1.87x faster across 215 PyTorch modules, with some workloads running hundreds of times faster than baseline.

Why use AI to generate kernels for Apple devices?

AI models execute on hardware via GPU kernels that define each operation. The efficiency of those kernels determines how fast models run (in training and inference). Kernel optimizations like FlashAttention1 show dramatic speedups over baseline, underscoring the need for performant kernels.

While PyTorch and tools like torch.compile 2 handle some kernel optimizations, the last mile of performance still depends on handtuned kernels. These kernels are difficult to write, requiring significant time and expertise. It gets especially challenging when writing kernels outside of CUDA: expertise in non-CUDA platforms is rarer, and there is less tooling and documentation available

We set out to answer a simple question: could frontier models implement kernel optimizations automatically, across different backends? Billions of Apple devices rely on Metal kernels that are often under-optimized, so we started with Metal.

Our vision: Autonomous kernel optimization for any target platform using frontier models.

Across 215 PyTorch modules, our results show the generated kernels ran 87% faster on Apple hardware compared to baseline PyTorch. This approach requires no expertise in kernel engineering and can be done nearly instantly.

Here's a preview of what we discovered:

Many cases where our approach improved performance by 10-100X

... continue reading