As Large Language Models (LLMs) move toward million-token context windows, we are hitting a physical limit: the KV Cache. Storing the Keys and Values for every single token in a sequence causes VRAM usage to scale linearly.
In the standard Transformer architecture, each new token requires access to the keys and values of all preceding tokens. As a result, the KV cache grows linearly with sequence length:$[\mathcal{O}(N)]$. For long-context models, this quickly exceeds the VRAM capacity of a single GPU (e.g., an H100), necessitating either distributed sharding or aggressive pruning strategies.
Existing Heavy Hitter or Top-K eviction strategies rely on a simple premise: if a token isn’t being looked at now, it won’t be looked at later. However, information in natural language is inherently context-dependent and non-stationary. A token that is irrelevant in one segment may become the primary anchor in another.
In this post, I’ll walk through another paradigm: The SRC (Selection-Reconstruction-Compression) Pipeline. Instead of deleting tokens, we mathematically summarize them using Information Theory and Linear Algebra.
To understand why this assumption fails, we first need to examine the structure of attention itself.
Why Pruning Fails: Attention is Structured
The attention mechanism does not operate on tokens independently. Instead, it produces a dense interaction pattern between queries and keys, where each token participates in a global computation.
Each row corresponds to a query token, and each column corresponds to a key token.
The intensity reflects how strongly a query attends to a given key.
Several observations emerge:
... continue reading