We are thrilled to announce Day-0 support for DeepSeek-V4 across both inference and RL training. SGLang and Miles form the first open-source stack to serve and train DeepSeek-V4 on launch day — with systems purpose-built for its hybrid sparse-attention architecture, manifold-constrained hyper-connections (mHC), and FP4 expert weights.
Figure 1. Decode throughput of SGLang vs the other OSS engine on a 30K-token prompt truncated from "Dream of the Red Chamber". We tried the best-effort spec configuration for each engine based on its official recipe. See benchmark notes for details.
TL;DR
SGLang and Miles ship Day-0 inference and RL for DeepSeek-V4 (1.6T Pro, 284B Flash).
Inference (caching & attention) : ShadowRadix prefix cache, HiSparse CPU-extended KV, MTP speculative decoding with in-graph metadata, Flash Compressor, Lightning TopK, hierarchical multi-stream overlap.
: ShadowRadix prefix cache, HiSparse CPU-extended KV, MTP speculative decoding with in-graph metadata, Flash Compressor, Lightning TopK, hierarchical multi-stream overlap. Inference (kernels & deployment) : fast kernel integrations (FlashMLA, FlashInfer TRTLLM-Gen MoE, DeepGEMM Mega MoE, TileLang mHC), DP/TP/CP attention, EP MoE on DeepEP, PD disaggregation.
: fast kernel integrations (FlashMLA, FlashInfer TRTLLM-Gen MoE, DeepGEMM Mega MoE, TileLang mHC), DP/TP/CP attention, EP MoE on DeepEP, PD disaggregation. RL training : full parallelism (DP/TP/SP/EP/PP/CP), tilelang attention, enhanced stability, FP8 training.
: full parallelism (DP/TP/SP/EP/PP/CP), tilelang attention, enhanced stability, FP8 training. Hardware: Hopper, Blackwell, Grace Blackwell, AMD, NPU.
Launch Commands: SGLang Cookbook
Model Key Features & New Capabilities
... continue reading