Compiling LLMs into a MegaKernel: A path to low-latency inference

One of the most effective ways to reduce latency in LLM inference is to fuse all computation and communication into a single megakernel — also known as a persistent kernel. In this design, the system launches just one GPU kernel to execute the entire model — from layer-by-layer computation to inter-GPU communication — without interruption. This approach offers several key performance advantages:

Eliminates kernel launch overhead, even in multi-GPU settings, by avoiding repeated kernel invocations; Enables software pipelining across layers, allowing the kernel to begin loading data for the next layer while computing the current one; Overlaps computation and communication, as a megakernel can simultaneously execute compute operations and inter-GPU communication to hide latency.

Despite these advantages, compiling an LLM into a megakernel is highly challenging. Existing high-level ML frameworks — such as PyTorch, Triton, and TVM — do not natively support end-to-end megakernel generation. Additionally, modern LLM systems are built from a diverse collection of specialized kernel libraries: NCCL or NVSHMEM for communication, FlashInfer or FlashAttention for efficient attention, and CUDA or Triton for custom computation. This fragmentation makes it difficult to consolidate the entire inference pipeline into a single, unified kernel.

Can we automate this process through compilation? Motivated by this question, our team from CMU, UW, Berkeley, NVIDIA, and Tsinghua developed Mirage Persistent Kernel (MPK) — a compiler and runtime system that automatically transforms multi-GPU LLM inference into a high-performance megakernel. MPK unlocks the benefits of end-to-end GPU fusion while requiring minimal manual effort from developers.

Why MPK?

A key advantage of MPK is extremely low latency for LLM inference by eliminating kernel launch overhead and maximally overlapping computation, data loading, and inter-GPU communication across layers.

Figure 1. Comparing LLM decoding latency between MPK and existing systems. We used a 39-token prompt and generated 512 tokens without speculative decoding.

Figure 1 illustrates a performance comparison between MPK and existing LLM inference systems on both single- and multi-GPU configurations. On a single NVIDIA A100 40GB GPU, MPK reduces per-token decoding latency from 14.5 ms — as achieved by optimized systems like vLLM and SGLang — to 12.5 ms, approaching the theoretical lower bound of 10 ms (based on loading 16 GB of weights with 1.6 TB/s memory bandwidth).

Beyond single-GPU optimization, MPK fuses computation and inter-GPU communication into a single megakernel. This design enables MPK to maximally overlap computation and communication. As a result, the performance improvements of MPK over current systems increase with the number of GPUs, making it particularly effective for multi-GPU deployments.

What’s Next?

... continue reading