Anatomy of a high-performance EP kernel

Large language models are large. Because they’re large, we need lots of GPUs to run them. It would be nice if LLM inference were ‘embarrassingly parallel’ and we could just always compute independent things on different GPUs. But alas, to use lots of GPUs on LLM inference, we need to get those GPUs talking to one another.

There are lots of different ways to get different GPUs working together: Tensor Parallelism, Pipeline Parallelism, Context Parallelism, Expert Parallelism, etc. All have their place. But for MoE models, in the MoE layers, when you want to serve at large scale, ‘wide Expert Parallelism’ (wideEP) is king See vLLM’s original DeepSeek large-scale serving post for a demonstration at production scale: DeepSeek at 2.2k tokens/s per GPU on an H200 cluster, served with wideEP and data parallel attention..

The other kinds of parallelism all require communication between GPUs, but their patterns are fixed by the architecture: who sends, who receives, and how much, are all known before the forward pass begins, and are the same on every step. The comms can run as standard collectives.

Expert parallelism is different. Which tokens need to reach which GPUs is decided by the router, from the data, at runtime, fresh in every MoE layer. And the tokens have somewhere to be reached from: we’ll assume the ‘data parallel attention’ arrangement DeepSeek serves with, where each token lives on exactly one rank (a rank being one GPU somewhere in our cluster). The experts are spread across those same ranks, so a token and the experts it’s routed to will generally not be in the same place. Here’s an example, with 8 GPUs split across 2 nodes, two experts per GPU, 1 token per rank, and 2 routed experts per token:

Hover a rank chip for its token’s round trip, or an expert for everything routed to it. Four of the sixteen experts drew no tokens at all this step: the routing is lumpy.

DISPATCH EXPERTS COMBINE NODE 0 · NVLINK NODE 1 · NVLINK crossing = RDMA · within a node = NVLink GPU 0 GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7 Expert 0 Expert 1 Expert 2 Expert 3 Expert 4 Expert 5 Expert 6 Expert 7 Expert 8 Expert 9 Expert 10 Expert 11 Expert 12 Expert 13 Expert 14 Expert 15 r0 r0 r1 r1 r2 r2 r3 r3 r4 r4 r5 r5 r6 r6 r7 r7

When it comes time to run our MoE layers, our tokens have to go and meet their experts, wherever they might be in the network fabric. It’s the job of the EP communication kernel to make that happen.

The modern shape of these kernels was set by DeepSeek’s DeepEP library. In this post we’ll build up the anatomy of a DeepEP-style dispatch and combine kernel: the high-throughput shape first, then the low-latency one.

The job we have to do §

Let’s make the setup concrete. We have 8 GPUs, split across 2 nodes, connected with RDMA, and each data parallel rank owns a single GPU. Attention runs on each GPU over a batch of B i B_i Bi tokens, where B i B_i Bi can vary between GPUs. We’re doing expert parallel with E = 16 E=16 E=16 experts, two per GPU, of which K = 2 K=2 K=2 are routed for each token.

... continue reading