Contextualization Machines

stochastic

chasm

March 10, 2025

Introduction

This post is meant to be an illustration of my mental model of a transformer, a sort of synthesis of a bunch of thoughts and ideas I’ve had over the past few months. It assumes knowledge on what a transformer architecture is.

I see a lot of people say transformers are next-token predictors. And while that’s true for how LLMs and other models may work, I feel like that still doesn’t really give you a good mental model on how the transformer operates. Over time, I’ve developed a mental model that helps me make sense of transformer behavior: I view them fundamentally as contextualization machines. After all, next-token prediction is a learning objective, not an architecture. That said, I work with LLMs mainly so this post will still have a big focus on LLMs particularly (decoder-only architecture and all that). As we go through the post, we’ll examine each component of a transformer through the lens of contextualization. I’ll illustrate the mental model by showing how it helps frame different research results and papers.

When I talk about contextualization, I mean contextualization of tokens and hidden states. One view of the decoder-only transformer that I find useful is to think of the residual chain as the main backbone of the model and the layers as additive transformations, instead of thinking of the main flow of states through the layers as the backbone and the residuals as anchoring states or something along those lines. Here’s a diagram to illustrate what I mean when I say this.

In a sense, each layer’s transformation of the hidden states can be viewed as a contextualization operation to the embedding, and then that contextualization is added back onto the token representation. If you graph out correlations (as cosine similarity) of hidden states between layers, you’ll see that the hidden states after each layer are pretty similar to the hidden states before the layer, with a small difference (what I would call the extra contextualization). Here’s a graph of cosine similarities between different layers’ hidden states in Llama-3.2-1B.

So, imagining a contextualiztion operation as something that enriches a token embedding or hidden state with more information, let’s see how I frame the transformer.

... continue reading