From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem

How the KV cache gives every AI conversation a physical weight in silicon, and what happens when the memory runs out.

This one leans more technical than our usual Sci-Fi Saturday fare. Stick with it, we get to the sci-fi by the end.

What KV Cache Actually Is

Someone types a forty-three-character question into ChatGPT, some throwaway query about dinner recipes or the capital of Mongolia. Before the first word of the response appears, those characters have been split into tokens, each token multiplied through billions of parameters to produce three vectors: a query, a key, and a value. The key-value pairs land in GPU memory, where they sit physically as bytes on a chip. That stored state is the model's awareness of the conversation, not a metaphor but a memory address.

The key-value cache exists for a practical reason. Without it, generating each new token would require reprocessing every previous token in the conversation from scratch. A 2,000-token exchange would mean re-reading the entire history 2,000 times. The KV cache eliminates that redundancy. Once a token's key-value pair is computed and stored, it stays. The next token only needs to attend to what's already cached. Computation drops from quadratic to linear.

Sebastian Raschka's LLM Architecture Gallery visualizes this mechanism across dozens of model families, and the numbers attached to each architecture make the weight tangible. GPT-2's KV cache costs 300 KiB per token in Raschka's comparison. That means a 4,000-token conversation occupies roughly 1.2 GB of GPU memory just for the cache, separate from the model weights themselves. Micron's engineering blog describes the KV cache as the point where "buzzword meets bottom line," and they're right. Every conversation has a physical cost measured in bytes, in watts, in cooling costs, in dollars per hour of GPU rental.

Your conversation weighs something and takes up space, and when the session ends, that space gets reclaimed and everything stored there vanishes.

How Memory Evolved

The way models handle their KV cache has changed four times in six years, and each change says something about what the designers thought was worth remembering. The numbers below come from Raschka's architecture comparisons.

GPT-2 (2019) used multi-head attention in its simplest form. Every attention head maintained its own independent set of keys and values. The cost: 300 KiB per token. Every head remembered everything in its own way, with no sharing and no shortcuts. As Raschka details in Build a Large Language Model (From Scratch), this was the straightforward design. Attention heads and memory were both cheap, so the design just remembered everything.

... continue reading