GoKawiil - Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon

🚀 KVSplit Differentiated KV Cache Quantization for Apple Silicon 📌 Overview Run larger context windows and heavier LLMs on your Mac by applying different quantization precision to keys vs values in the attention mechanism's KV cache. KVSplit enables you to: Reduce memory usage by up to 72% with minimal quality loss with minimal quality loss Run 2-3x longer contexts in the same memory budget in the same memory budget Maintain or improve inference speed compared to FP16 compared to FP16 Optimize for Apple Silicon with full Metal support Key Findings Configuration VRAM @ 8K tokens Tokens/sec Perplexity Change FP16 (base) 176.00 MB (100%) 54,360 -- K8V8 (8-bit) 93.50 MB (47%) 51,503 +0.03% K8V4 71.50 MB (41%) 57,438 +0.86% K4V8 71.50 MB (41%) 58,690 +6.06% K4V4 (4-bit) 49.50 MB (28%) 55,193 +6.15% Memory Savings by Sequence Length Configuration 128 tokens 2048 tokens 4096 tokens 8192 tokens FP16 (baseline) 5.50 MB 44.00 MB 88.00 MB 176.00 MB K8V8 (8-bit) 2.92 MB 23.38 MB 46.75 MB ... Read full article.

Find Related products on Amazon

Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon

Related Articles