We bought the whole GPU, so we're damn well going to use the whole GPU

Intro Post | Code | Low-Latency Megakernels | Brr TLDR: We're releasing a throughput-optimized megakernel for tensor-parallel inference with Llama-70B on H100s. Our kernel can aggressively overlap compute, memory, and communication ops in order to simultaneously use the different hardware resources available on a GPU. When integrated into the Tokasaurus inference engine, our megakernel can outperform SGLang by >22% on end-to-end throughput (measured as time to finish 65,536 prompts from the ShareGPT benchmark). We're releasing the code here; please be warned that this really is research code; it is sensitive to compiler versions, GPU setup, and sometimes even being looked at the wrong way, and we have no intention whatsoever of supporting it. We hope you'll find the ideas and results interesting nonetheless! Figure 1: Zoooommmm A few months ago, we showed how we could fuse an entire model forward pass into a single "megakernel" in order to deliver low-latency inference with Llama-1B. In that post, we teased that many of the same concepts we introduced would also be useful for optimizing for throughput. We're now excited to bring receipts and release a new megakernel optimized for high-throughput inference with Llama-70B. The inference workloads targeted by our low-latency and high-throughput megakernels are quite different and require distinct optimizations. Our low-latency megakernel targeted inference using Llama-1B when running on a single GPU with batch size one. This workload was entirely memory bound, and our focus was therefore on eliminating stalls that delayed loading model weights from global memory. With large-batch Llama-70B inference, our workload is much more heterogeneous. Large portions of it (e.g. matrix multiplies, attention prefill) are compute-bound. Other parts (e.g. attention decode, RMS norm) are still bottlenecked by global memory bandwidth. Additionally, by distributing our model across multiple GPUs, we now need to perform cross-GPU communication that throttles the NVLink connections between devices. By running these components sequentially, we've paid for the whole GPU, but are only using little bits and pieces of it at a time. :( Overall, these different operations in our model each make use of different resources available on the GPU (e.g. tensor cores, non-matmul compute units, HBM bandwidth, NVLink bandwidth) in unique ways. Therefore, a key area for optimizing this high-throughput workload is to overlap multiple kinds of work in order to simultaneously use more of the GPU's resources. We want to do this across many levels of the GPU -- within an individual SM, across multiple SMs, and even across GPUs. Existing approaches to overlapping include assigning different SMs to different ops, developing custom kernels to run prefill and decode simultaneously, and running kernels in parallel with cross-gpu memory copying operations. Here, we show that the same simple, interpreter-based megakernel patterns we previously introduced can also achieve all of these fine-grained overlapping patterns -- and more! Most excitingly, despite the significant differences between our low-latency and high-throughput workloads, our core megakernel abstraction (a pipelined instruction interpreter that runs on each SM) is highly transferable across both domains. In the rest of this blog, we will: Give a brief recap on the design of our megakernels from our last, low-latency post. Walk through the details of the tensor-parallel Llama forward pass that we map into our megakernel, including a novel approach to communicating intermediate results across GPUs right after running attention. This new operation requires a complicated multi-GPU transpose not efficiently expressable with standard communication patterns, but is trivial to implement within the megakernel! Show how megakernels can achieve fine-grained resource overlapping at multiple levels of the GPU hierarchy: within individual SMs, across multiple SMs, and across multiple GPUs! Within individual SMs, the same inter-instruction pipelining we used in low-latency llama can also help keep overlap memory movement and compute across instructions, thereby keeping the tensor cores running. Across multiple SMs, careful scheduling of instructions can overlap both compute-intensive (e.g. matrix multiply) and memory-intensive (e.g. RMS norm) kinds of work at once, on an individual GPU. Across GPUs, we can hide communication costs within special "storer" threads, leaving other threads free to do work on the next instruction while communication happens in the background. Finally, we put it all together by benchmarking our megakernel against vLLM and SGLang. Megakernels: A Brief Recap In our last post, we wrote our first full-model megakernel in order to optimize a low-latency scenario: running inference with Llama-3.2-1B and batch size one. We discovered that popular inference engines like vLLM and SGLang were only using about half of the available GPU bandwidth on an H100. The problem is that traditional systems break down model forward passes into dozens or hundreds of separate kernels, each with setup and teardown periods where no useful work gets done. These overhead periods create "memory pipeline bubbles" where an SM (i.e. a streaming multiprocessor, one of the compute subunits on a GPU) sits idle instead of loading model weights. Our solution to this was to merge the entire Llama-1B forward pass into a single fused "megakernel" that eliminates kernel boundaries altogether. We found that on small models, our megakernel could provide per-user throughput around 50% higher than inference frameworks like SGLang and vLLM. The core abstraction behind our megakernel lays in an instruction-and-interpreter model. Instructions: Instead of decomposing a model forward pass into a series of coarse-grained kernels, we instead decomposed it into a sequence of fine-grained instructions. Instructions can have distinct types, loosely corresponding to the kinds of kernels one would use in conventional implementations (e.g. matrix multiply, attention prefill, RMS norm). Each instruction specifies a unit of work that would traditionally be performed by a thread block, e.g. compute an output tile for a matrix multiplication. Furthermore, each instruction is organized into dedicated sections, e.g. a load function that reads from global memory, a compute function, and a store function that writes out results. Instead of decomposing a model forward pass into a series of coarse-grained kernels, we instead decomposed it into a sequence of fine-grained instructions. Instructions can have distinct types, loosely corresponding to the kinds of kernels one would use in conventional implementations (e.g. matrix multiply, attention prefill, RMS norm). Each instruction specifies a unit of work that would traditionally be performed by a thread block, e.g. compute an output tile for a matrix multiplication. Furthermore, each instruction is organized into dedicated sections, e.g. a load function that reads from global memory, a compute function, and a store function that writes out results. Interpreter: We execute these instructions using an on-GPU interpreter. When the megakernel launches, each SM initializes an interpreter and starts executing a sequence of instructions (these sequences are scheduled into per-SM queues ahead of time). A key feature of these interpreters is that they can aggressively pipeline across instruction boundaries, starting tasks for the next instruction (e.g. loading model weights) while the current instruction finishes. For our low-latency megakernel, this let us eliminate most of the memory bubbles between operations. For more details on the interpreter design (e.g. how we manage shared memory across instructions, how we synchronize different SMs), see the original blog post and the codebase. In this blog, we'll focus on a high-throughput workload with very different performance considerations than our previous Llama-1B target. However, as we'll describe below, this same core instruction/interpreter abstraction will be extremely helpful for achieving high throughput. THE WORKLOAD Anatomy of a Llama First, we'll start with a brief walkthrough of the operations needed to perform a large-batch forwards pass using tensor-parallel Llama-70B. Specifically, we implement the "sequence parallel" variant of TP, where some operations are performed data-parallel (i.e. each GPU holds full activation vectors for a slice of the tokens in a batch) and some operations are performed tensor-parallel (i.e. each GPU holds a slice of the activation vectors for all tokens). Concretely, with sequence parallelism each transformer block receives a data-parallel chunk of the hidden states (i.e. the full hidden states for a subset of tokens) as input and performs the following operations: Data-parallel pre-attention RMS norm. All-gather (i.e. each GPU collects the activations from all other GPUs, so that each GPU now has the activations from all tokens). Tensor-parallel QKV projections, attention, and O projection (each GPU is responsible for a subset of attention heads). Reduce-scatter. Data-parallel post-attention residual connection and pre-MLP RMS norm. All-gather again. Tensor-parallel MLP. Reduce-scatter. Post-MLP residual connection. However, we've made one important change to this formulation. One of our targets for operation overlapping is to overlap the O projection matrix multiplication with the subsequent reduce-scatter. However, the tensor-parallel sharded O matrix is too small for us to effectively hide the reduce-scatter communication cost. To solve this, we instead choose to replicate the O projection matrix across each GPU and run the O projection with data parallelism instead of with tensor parallelism. Alongside this change, we eliminate the post-attention reduce-scatter and replace it with a "distributed transpose" operation after attention that repartitions our data from a tensor-parallel configuration to a data-parallel configuration. When using 8 GPUs, this reduces the network traffic by a factor of 8, which makes it much easier to hide the cost of communication cost by overlapping it with matrix multiplications. Note that the downside of this approach is it reduces the maximum batch size by about 15%, because replicating the O projection weights consumes an additional 9 GB of memory per GPU. Defining our Megakernel Instruction Set With our parallelism scheme decided on, we are able to construct the instruction set for our high-throughput megakernel. We partition our workload into the following fused instructions: RMS norm + all-gather. A QKV matrix multiply + RoPE. Attention + distributed transpose. O-projection matrix multiply + residual. Gate matrix multiply + SiLU. Up-projection matrix multiply + elementwise multiplication with the output of the gate. Down matrix multiply + reduce-scatter + residual. RMS norm without the all-gather (for the LM head) LM head matrix multiply. Relative to our latency-focused Llama-1B megakernel, this instruction set contains several high-level changes in our approach: Most of our instructions for low-latency centered around matrix-vector multiplication, rather than the matrix-matrix multiplications we do here. The optimal work partitioning for these two operations is generally completely different. For matrix-vector products, each instruction computes several complete columns of the output vector. However, in matrix-matrix products, each instruction computes a tile of the output matrix instead. To avoid extra trips to global memory, for our low-latency megakernel we frequently recomputed results across the GPU rather than communicating them through memory. This allowed us to fuse operations more aggressively than usual. For example, our QKV matrix-vector product fused the RMS norm into its beginning, saving an instruction boundary, but performing identical RMS computations over a hundred times. When focusing on throughput, this recomputation is not worth it. Putting many tokens in our batch requires us to expand our inter-instruction signalling scheme [link to relevant section in llama-1b post] to track data dependencies across tokens. This signalling can vary by instruction -- for some operations (e.g. attention, RMS norms), sometimes synchronization is at the granularity of 128 output rows of a matrix; other times it's at the granularity of an individual attention head for an individual token. Overlapping Resources Within a Megakernel Our primary goal when optimizing forward pass for throughput is to overlap hardware resources (e.g. use as much GPU memory bandwidth, compute units, and interconnect bandwidth as possible). Below, we show that our megakernel allows us to do this at three levels of hierarchy: within an SM by overlapping the stages of different instructions, across SMs by running different instructions, and across GPUs by overlapping communication with other work. Overlapping within the SM Within individual SMs, we make use of our megakernel template's instruction pipeline (previously described here) to overlap loading weights and activations for the next instruction with performing compute for the previous instruction. Even though our objective is now throughput, it's still useful to be able to start loading the next data in advance -- to keep matrix multiplies running as quickly as possible. Within the SM, our interpreter specializes threads to different functions: each of the load, compute, and store functions within the instruction template is executed by its own, independent set of threads. This means that even while a compute or store is running, the loads for the next matrix multiplies can start as soon as possible. To help understand this, we've written profiling tools that make it easier to see what's going on here. Figure 2, below, shows the brief transition between two different kinds of instructions on a single SM -- the Gate SiLU instruction (brown) and the Up matmul instruction (pink). If you'd like to build a better intuition for the profiler, you can access it here, alongside an example profile to download and play with. Figure 2: A zoomed-in snapshot of a single SM across about 15 microseconds, as it transitions from a Gate SiLU instruction to an Up matmul instruction. First, a quick tutorial on how to read what's going on in this zoomed-in profile snapshot: The three horizontal tracks represent different kinds of threads within the interpreter that runs instructions. At the top are the loader threads, which pull data from global memory into shared memory. The thick band in the middle represents consumer threads that perform the main work -- in this case, running the tensor cores. The bottom row tracks storer threads, which store results from shared memory back up to global memory. Finally, in the background are controller threads that help coordinate the interpreter, although they're not programmed by the user. Different colored bars represent different kinds of instructions. In this case, the brown bars on the left correspond to the Gate SiLU instruction, and the pink bars on the right correspond to the Up matmul instruction. Thin vertical lines represent different kinds of events. For example, blue and cyan lines correspond to different kinds of loads being issued. Purple lines represent the beginning of a compute phase, and yellow and orange lines correspond to different kinds of stores. Finally, red lines represent wait events, and green lines represent ready states. In general, we have instructions only report events from the first 8 and last 4 stages, since we have limited timing slots. Tall vertical lines in the background represent events happening within the controller warp of the interpreter. The salmon line tells the SM to fetch the next instruction, the pale green line indicates that the next instruction is set up, and the white line indicates the last instruction has finished and can now be torn down. Here's a complete timeline of this little snapshot. Time Threads Action 4234.78 μs Loader Starts issuing loads for last four stages of matrix multiply pipeline (128 x 64 x 256). Dark blue lines = A matrix loads, cyan lines = B matrix loads (300ns later) + 0.51 μs Loader Signals controller to begin setting up next instruction (after fourth-to-last load) + 2.53 μs Consumer Begins running fourth-to-last matrix multiply (associated with first load) + 3.23 μs Controller Finishes setting up next Up matmul instruction, entailing fetching and decoding instruction, setting up semaphores, and remapping shared memory pages. + 3.68 μs Loader Begin running dependency check before loading inputs. + 5.25 μs Consumer Finish running matrix multiplies, begin storing results into two unreleased shared memory pages + 5.31 μs Loader Resolves dependency check, and issues loads for first 2.5 stages, before stalling due to lack of available shared memory pages. + 8.16 μs Storer Receives results, launches asynchronous store to global memory. Consumer threads start Up matmul matrix multiplies while shared memory still in use + 8.64 μs Consumer Begins running matrix multiplies on Up matmul instruction. + 9.09 μs Storer First store finishes reading from shared memory, releases one page. Loader threads restart load pipeline. + 10.18 μs Storer Final store finishes reading from shared memory, releases last page to Up matmul instruction. Up matrix multiply pipeline completely unblocked for the rest of the instruction. + 12.13 μs Storer Asynchronous stores to global memory complete. Atomically increments flag in global memory signifying instruction completion. + 12.38 μs Controller Notified all threads completed instruction work. Begins teardown, invalidating previous instruction semaphores, and writing timing data to global memory. Now let's contrast this with a snapshot with the instruction pipeline disabled. Figure 3: The exact same profile, but with no inter-instruction pipelining. In this ablated profile, the store must finish, instruction be torn down, next instruction set up, and memory loaded, before matrix multiplies can begin again. Whereas with instruction pipelining enabled, the extra gap between consecutive matrix multiply stages is just 3.4 microseconds, without the pipeline, this gap jumps to 10.2 microseconds -- meaning this optimization alone reduces runtime by over 7% on these instructions. Nor is this effect isolated to the boundary of these two particular instructions; it shows up everywhere. In Figure 4, we take a look at some zoomed out profiles showing all 8 GPUs, and we'll use similar profiles for other ablations in the rest of this post, so it's worth understanding this profile well. Figure 4: Block profiles of pipelined versus serial instruction execution. Serial execution has a lot more gaps! What we're looking at here represents a little over two full transformer blocks of a Llama-70B forward pass with a batch size of 8,192, across all 8 GPUs. Unlike the zoomed-in view, where each SM separates into three separate bars, each horizontal bar here just represents the activity of the consumers for that SM. And, as before, each different color represents a different kind of instruction. From left to right: Blue represents the Attention RMS norm and all-gather. Orange represents the QKV matrix multiply. Green represents the attention and inter-GPU transpose. Red represents the O-projection matrix multiply. Purple represents the MLP RMS norm and all-gather. Brown represents the Gate SiLU. Pink represents the Up matrix multiply. Grey represents the Down projection matrix multiply. We also lightly shade the background to make it easier to distinguish SMs on different GPUs: if you look closely you should see light color bands to help make this visible. In the case of figure 4, the pipelined profile runs at 31,516 decoding TPS for this particular test workload (Prefill "tell me a funny joke about cookies", 30 decode tokens -- we have absolutely cornered the market on efficiently generating short cookie jokes), whereas the serial profile runs at just 29,607 TPS, corresponding to a difference of over 6%. This difference turns out to persist across different batch sizes, and generally provides around 2-6% end to end MFU, as shown in the table below. Batch size Best config Best config minus pipelining % Difference 1024 18,676 18,201 2.5% 2048 26,388 24,641 6.6% 4096 29,214 28,426 2.7% 8192 31,516 29,607 6.1% Table 1: Decoding TPS Generating Short Cookie Jokes at Various Batch Sizes, with and without Pipelining As a final note of this section, relating to the profiling itself, one might reasonably ask what overhead generating these plots incurs. It turns out to be very little: across 32 separate experiments we ran in the course of writing this post, we measured each runtime with and without generating timing data, each of which is a separate compilation path. We found the average difference to be just 0.39%, with a maximum of 1.07%. So, although timing may introduce a small amount of distortion, we think that this data is overall quite reliable. All TPS numbers are reported without timing recording enabled. Overlapping across SMs With our low-latency megakernel, each SM was assigned its own queue of instructions that are scheduled in advance. Instead, for our high-throughput megakernel, we create a global work queue -- a single list of instructions that defines all the work that needs to run on the GPU. When an SM needs to fetch a new instruction to run, it atomically increments a global instruction counter that keeps track of the next instruction to be assigned. This approach is automatically robust to jitter in the execution across different SMs; if one SM is slow to finish its instruction relative to others, it will simply delay its request for new work, allowing other SMs to pick up the slack. This solution wasn't possible for our low-latency megakernel, because the runtime of each instruction was so fast that the latency of this atomic increment would be prohibitive. But with a throughput-oriented megakernel -- where individual instructions frequently take 100 microseconds or more -- this cost can be entirely hidden as part of our instruction pipeline. Figure 5: Ablating the global work queue. On top, we use the global work queue. On the bottom, we use a simple round robin scheduler to assign work. The global work queue effectively smooths out variances in runtime that are present in the round-robin scheduler. In figure 5, we ablate the global work queue by replacing it with a simple round-robin scheduler, and find a 14.2% end-to-end reduction in performance at a batch size of 8,192. A broader report is provided in table 2. Batch size Best config Best config without GWQ % Difference 1024 18,320 18,676 -1.9% 2048 26,388 25,518 3.3% 4096 29,214 27,372 6.3% 8192 31,516 27,033 14.2% Table 2: Decoding TPS Generating Short Cookie Jokes at Various Batch Sizes, with and without the Global Work Queue As one can see from table 2, the global work queue becomes useful at large batch sizes, where there is enough that work that jitter across SM's becomes important, and eliminating that jitter with dynamic scheduling makes a big difference. At very small batch sizes, the overhead in the global work queue actually outweighs its benefit. Finally, it's worth noting that the global work queue is not the only way to improve scheduling over a naive round-robin scheduler; many other schedulers might work well. However, static schedulers cannot adapt to runtime jitter in the same way that the global work queue does; the variance in runtime across GPUs in figure 5 (despite them having nearly identical schedules) suggests that this jitter is a major factor. Overlapping across GPUs Networking Background In order to implement our tensor-parallel Llama, we need to be able to exchange data between GPUs. In general, NVIDIA gives us two ways to do this. One common approach is to use the GPU's copy engines -- dedicated hardware for copying big, contiguous chunks of memory within or across devices. One advantage of using the copy engines is that these copies don't need to run within a kernel, freeing up the GPU's SMs to do other useful work! By using multiple cuda streams, we can launch copy engine operations that overlap with kernel computations (e.g. as is done in PyTorch's AsyncTP). The other way to transfer data between GPUs is to do so within a kernel, using the GPU's SMs to write data into remote memory on other GPUs via CUDA's unified memory architecture (thanks Bill!) We've added a corresponding new abstraction to ThunderKittens called the Parallel Global Layout (PGL). With PGLs, we perform asynchronous loads and stores directly to global memory on other devices, overlapping them with compute and local memory operations to achieve near-zero cost. We also leverage NVSwitch's central accelerator to offload collective operations to hardware outside the GPU. Read more about PGLs and our multi-GPU approach in our earlier blog post. In our megakernel we use the second approach, because it gives us the control we need to perform all-gathers, reduce-scatters, and our post-attention distributed transpose (which allows us to do the O-projection in data-parallel form). We perform all communication from our dedicated storer threads, allowing loader and compute threads to move onto future work while inter-GPU communication is performed in the background on the same SM. Interleaving Warp specialization, instruction pipelining, and the global work queue now give us a way to overlap different hardware resources within an SM. However, we can further benefit from GPU resource overlapping if we can assign different types of instructions to different SMs. For example, with a large batch size, tokens at the beginning of the batch will have their compute-bound Down projections completed earlier than tokens at the end of the batch. This makes these early tokens ready to start the next instruction, which is the network-bound pre-attention RMS norm and all-gather. If we run that RMS norm and all-gather for these early tokens on some SMs, while computing the Down projection for the later tokens on different SMs, we can reduce peak network bandwidth and better exploit the hardware resources available on the device. Some prior work, like NanoFlow, implemented this technique on A100s by constructing a schedule ahead-of-time that assigns SMs to different groups, namely compute-focused SMs that compute matmul-heavy ops, memory-focused instructions that compute attention decoding, and comms-focused SMs that communicate results across devices. With our megakernel, we can perform this overlapping at a much finer granularity by interleaving instructions from different ops into our global work queue. Once we have scheduled enough Down projection instructions for some tokens to be ready for attention, we can start interleaving RMS norm instructions while we add the remaining Down projection instructions into our schedule. This interleaving lets different SMs run different kinds of work without needing to explicitly assign SMs to groups like in NanoFlow. Figure 6: Ablating interleaving. On top, we use our standard interleaved schedule. On the bottom, we disable interleaving. Notice how the two RMS norms (purple and blue) are no longer quilted the same way, but instead take up 200-300 microseconds each time they come up. (Note that although QKV and Gate SiLU instructions may begin early, they rely on many RMS norms to satisfy their dependencies, and usually start after all of the RMS norm instructions are finished.) The effect of the interleaving is easy to see in our profiles, since it creates "quilts" (i.e. dense multi-colored regions where different SMs are running different kinds of instructions at a given point in time). Batch size Best config Best config minus interleaving % Difference 1024 18,676 18,663 0.1% 2048 26,388 26,388 0.0% 4096 29,214 28,520 2.4% 8192 31,516 29,492 6.4% Table 3: Decoding TPS Generating Short Cookie Jokes at Various Batch Sizes, with and without Interleaving As one can see from table 3, interleaving also kicks in at large batch sizes (like the global work queue), where there is enough work on the GPU to allow several waves of instructions of each type, and therefore the effective interleaving of those waves. At these large batch sizes, it's just as important of an optimization as intra-SM pipelining. The key advantage of our approach is that it eliminates the overhead of additional kernel launches and cross-stream synchronization. We also note that we've removed our reliance on NCCL entirely, continuing in our march towards Obadiah Stane's paragon of minimal dependencies. Of course, there's been an astounding amount of work attempting to hide communication overhead alongside compute -- it was surprising to us how straightforward it was to overlap communication within the megakernel framework! REEEEE(sults) To evaluate our megakernel, we integrated it into Tokasaurus, which helps schedule batches of prefill and decode, alongside KV pages. This also allows us to schedule megakernel instructions on the CPU while the previous batch is running on the GPUs; with 64 threads generating instructions in C++, we generally find >90% CPU idle time. For our benchmark, we sampled a set of 65,536 prompts + completion lengths from the ShareGPT dataset, and ran them through both SGLang and our Megakernel. We reproduced SGLang using their recommended benchmarking settings; nonetheless, we recognize that expert tuning is sensitive and important. We report the input, output, and total throughputs in Table 4 (a more precise restatement of the results from Figure 1). System Input Throughput (Tokens/s) Output Throughput (Tokens/s) Total Throughput (Tokens/s) SGLang 11,783 7,387 19,170 Megakernel 14,425 9,043 23,468 Even despite these promising initial results, we suspect there's still considerable room for optimization within megakernels. Our scheduling heuristics are quite simple, our instructions frequently stall on synchronizations that could likely be hidden better, and our megakernel still has register spills and other low-level problems. All of these point towards this being an exciting direction for future work! Conclusion: Megakernels are Cool In this post, we introduced a tensor-parallel Llama-70B megakernel focused on maximizing decoding throughput. We designed a custom instruction set for this megakernel within our megakernel interpreter framework, and ablated several key scheduling decisions including pipelining across instruction boundaries, choosing processors for each instruction, and interleaving communication with compute. Finally, we integrated our megakernel into Tokasaurus, and found it outperformed SGLang by 22% on ShareGPT prompts. A direction for future work: a key challenge of writing megakernels is that there is tremendous complexity in both designing these custom instruction sets, and coordinating (and especially debugging) synchronization patterns across GPUs. A corresponding learning from this work is that, going forward, we'd like to design a more general megakernel instruction set and abstract these decisions into the host-side scheduler, in order to simplify the process of designing high-performance megakernels. We think this might make megakernels for training viable, alongside megakernels for inference. *Work by Jordan done while at Stanford.

We bought the whole GPU, so we're damn well going to use the whole GPU

Share this article

Related Articles