Skip to content
Tech News
← Back to articles

DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles

read original more articles
Why This Matters

The launch of Day-0 support for DeepSeek-V4 with open-source tools SGLang and Miles marks a significant advancement in AI model deployment and training, enabling faster inference and more stable reinforcement learning. This development enhances the capabilities of large-scale models with hybrid sparse-attention architectures, offering improved performance and efficiency for both industry and consumers. It paves the way for more accessible, high-performance AI systems that can be rapidly adopted across various applications.

Key Takeaways

We are thrilled to announce Day-0 support for DeepSeek-V4 across both inference and RL training. SGLang and Miles form the first open-source stack to serve and train DeepSeek-V4 on launch day — with systems purpose-built for its hybrid sparse-attention architecture, manifold-constrained hyper-connections (mHC), and FP4 expert weights.

Figure 1. Decode throughput of SGLang vs the other OSS engine on a 30K-token prompt truncated from "Dream of the Red Chamber". We tried the best-effort spec configuration for each engine based on its official recipe. See benchmark notes for details.

TL;DR

SGLang and Miles ship Day-0 inference and RL for DeepSeek-V4 (1.6T Pro, 284B Flash).

Inference (caching & attention) : ShadowRadix prefix cache, HiSparse CPU-extended KV, MTP speculative decoding with in-graph metadata, Flash Compressor, Lightning TopK, hierarchical multi-stream overlap.

: ShadowRadix prefix cache, HiSparse CPU-extended KV, MTP speculative decoding with in-graph metadata, Flash Compressor, Lightning TopK, hierarchical multi-stream overlap. Inference (kernels & deployment) : fast kernel integrations (FlashMLA, FlashInfer TRTLLM-Gen MoE, DeepGEMM Mega MoE, TileLang mHC), DP/TP/CP attention, EP MoE on DeepEP, PD disaggregation.

: fast kernel integrations (FlashMLA, FlashInfer TRTLLM-Gen MoE, DeepGEMM Mega MoE, TileLang mHC), DP/TP/CP attention, EP MoE on DeepEP, PD disaggregation. RL training : full parallelism (DP/TP/SP/EP/PP/CP), tilelang attention, enhanced stability, FP8 training.

: full parallelism (DP/TP/SP/EP/PP/CP), tilelang attention, enhanced stability, FP8 training. Hardware: Hopper, Blackwell, Grace Blackwell, AMD, NPU.

Launch Commands: SGLang Cookbook

Model Key Features & New Capabilities

... continue reading