Show HN: KVSplit β Run 2-3x longer contexts on Apple Silicon
Published on: 2025-07-05 00:04:58
π KVSplit Differentiated KV Cache Quantization for Apple Silicon
π Overview
Run larger context windows and heavier LLMs on your Mac by applying different quantization precision to keys vs values in the attention mechanism's KV cache. KVSplit enables you to:
Reduce memory usage by up to 72% with minimal quality loss
with minimal quality loss Run 2-3x longer contexts in the same memory budget
in the same memory budget Maintain or improve inference speed compared to FP16
compared to FP16 Optimize for Apple Silicon with full Metal support
Key Findings
Configuration VRAM @ 8K tokens Tokens/sec Perplexity Change FP16 (base) 176.00 MB (100%) 54,360 -- K8V8 (8-bit) 93.50 MB (47%) 51,503 +0.03% K8V4 71.50 MB (41%) 57,438 +0.86% K4V8 71.50 MB (41%) 58,690 +6.06% K4V4 (4-bit) 49.50 MB (28%) 55,193 +6.15%
Memory Savings by Sequence Length
Configuration 128 tokens 2048 tokens 4096 tokens 8192 tokens FP16 (baseline) 5.50 MB 44.00 MB 88.00 MB 176.00 MB K8V8 (8-bit) 2.92 MB 23.38 MB 46.75 MB
... Read full article.