Autoregressive next token prediction and KV Cache in transformers
(news.ycombinator.com)
1.
2.
KV Cache Is Becoming the Memory Hierarchy of Inference
(news.ycombinator.com)
3.
High-Fidelity KV Cache Summarization Using Entropy and Low-Rank Reconstruction
(news.ycombinator.com)
4.
From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem
(news.ycombinator.com)
5.
6.
Nvidia says it can shrink LLM memory 20x without changing model weights
(venturebeat.com)