Skip to content
Tech News
← Back to articles

KVarN: Native vLLM KV-cache quantization back end by Huawei

read original more articles
Why This Matters

KVarN by Huawei introduces a native vLLM backend that significantly enhances KV-cache capacity and throughput while maintaining FP16-level accuracy. Its plug-and-play design simplifies integration, enabling longer contexts and more concurrent requests, which is crucial for advancing large language model performance in production environments.

Key Takeaways

⚡️ Built for agentic and long-context workloads.

💡 KVarN delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, so you fit far longer contexts and serve more concurrent requests, with FP16-level accuracy.

🔌 Calibration-free, plug-and-play with vLLM. A native vLLM attention backend: add one flag, no model changes, no calibration.

🥊 Up to ~2.4× TurboQuant throughput, same capacity, higher accuracy.

Why KVarN (Variance Normalized KV-Cache)?

kvarn /kvɑːɳ/ · noun (Swedish) A grinding apparatus used to reduce substances into smaller particles or powder, especially grains, seeds, spices, coffee beans, KV-caches.

KV-cache quantization usually comes with a catch. As the vLLM TurboQuant blog shows, existing methods buy extra KV-cache capacity but give up throughput (TurboQuant reports 40 to 52% lower throughput for 2.3-3.7x capacity), and aggressive low-bit quantization also tends to cost accuracy. Losing both speed and quality is the main reason KV-cache quantization is rarely turned on in production.

KVarN is built to keep both. On Qwen3-32B (AIME25, 16K-context burst, TP=2) it matches FP16 accuracy and beats its throughput while delivering ~4× the KV-cache capacity:

KVarN stays in the upper-right corner the blog's methods can't reach: FP16-level accuracy, FP16-or-better throughput, and several times the context.

Quickstart

... continue reading