Last week I was writing about the hardware side of the AI memory problem: the HBM density penalty, the EUV bottleneck, and the supply chain pressure squeezing DRAM prices for everyone from data centre operators down to consumer electronics. This week, Google published something that attacks the exact same problem using another approach: not “build more memory”, but “need less of it.”
You guessed it! This post will dive a bit deeper into what TurboQuant is, and what this may imply to the field of AI. What Pied Piper achieved in the Silicon Valley TV Show with their general-purpose lossless compression algorithm, Google may have achieved it for the compression of information represented as vectors in a high-dimensional space.
What is a transformer? And the KV cache?
But before getting into what TurboQuant does, let’s make a brief detour to understand what is this algorithm is actually built to compress, and why it is important for LLMs and the memory problem.
GPT models are what are known as autoregressive: they generate text one token at a time, where each new token is conditioned on everything that came before. You send a prompt, the model reads all of it, picks the most likely next word, appends it, reads everything again, picks the next word, and so on. One token at a time, left to right, until it decides to stop.
The core mechanism that lets the model read everything at each step is called attention. For every token in the sequence, the model computes three vectors: a query, a key, and a value. You can think of these data structures as a bit more complex key-value stores. To generate the next token, the model compares the current query against every previous key, essentially asking “which past tokens are relevant right now?”, and uses the answer to weigh the corresponding values and build up context.
This is implemented (as you may all know by now) through the transformer architecture. Transformer layers are responsible for encoding the input sequences into a meaningful representation, applying the attention mechanism, and decoding into an output representation. All LLMs are architectural variations of this basic cell.
To get a sense of each of these variations I highly recommend Sebastian Raschka’s LLM Architecture gallery: from GPT-2 to DeepSeek and GLM.
The keys and values for every previous token are recomputed from scratch on every single pass through architecture. If your conversation is N tokens long and you’re generating token N+1, the model recalculates N sets of keys and values it already calculated on the previous step. This is slow and wasteful in terms of the resources.
The obvious fix to this is to cache them. The query, key and values are computed once per token and stored so they can be looked up in subsequent steps instead of being recalculated. This is the KV cache, a running store of QKV tokens from all previous tokens stored in GPU memory (so they are readily accessible when needed).
... continue reading