vLLM large scale serving: DeepSeek 2.2k tok/s/h200 with wide-ep

Introduction

In v0.11.0, the last code from vLLM V0 engine was removed, marking the complete migration to the improved V1 engine architecture. This achievement would not have been possible without vLLM’s community of 1,969 contributors, authoring over 950 commits in the past month (as of 12/18/25).

These efforts have been validated by vLLM’s inclusion in the SemiAnalysis open source InferenceMax performance benchmarks. In addition, vLLM is proud to be trusted in production by teams at Meta, LinkedIn, Red Hat, Mistral, and HuggingFace.

DeepSeek-style disaggregated serving and sparse mixture-of-experts (MoE) model deployments remain state-of-the-art for high-performance LLM inference. This article outlines the key optimizations the vLLM team has built to push throughput even further, including:

Async scheduling

Dual-batch overlap

Disaggregated serving

CUDA graph mode FULL_AND_PIECEWISE

DeepGEMM enabled by default

DeepEP kernels integration

... continue reading