Introduction
In v0.11.0, the last code from vLLM V0 engine was removed, marking the complete migration to the improved V1 engine architecture. This achievement would not have been possible without vLLM’s community of 1,969 contributors, authoring over 950 commits in the past month (as of 12/18/25).
These efforts have been validated by vLLM’s inclusion in the SemiAnalysis open source InferenceMax performance benchmarks. In addition, vLLM is proud to be trusted in production by teams at Meta, LinkedIn, Red Hat, Mistral, and HuggingFace.
DeepSeek-style disaggregated serving and sparse mixture-of-experts (MoE) model deployments remain state-of-the-art for high-performance LLM inference. This article outlines the key optimizations the vLLM team has built to push throughput even further, including:
Async scheduling
Dual-batch overlap
Disaggregated serving
CUDA graph mode FULL_AND_PIECEWISE
DeepGEMM enabled by default
DeepEP kernels integration
... continue reading