Understanding Transformers Using a Minimal Example

The internal mechanisms of Transformer Large Language models (LLMs), particularly the flow of information through the layers and the operation of the attention mechanism, can be challenging to follow due to the vast amount of numbers involved. We humans can hardly form a mental model. This article aims to make these workings tangible by providing visualizations of a Transformer's internal state. Utilizing a minimal dataset and a deliberately simplified model, it is possible to follow the model's internal processes step-by-step. One can observe how information is transformed across different layers and how the attention mechanism weighs different input tokens. This approach offers a transparent view into the core operations of a Transformer.

After training for 10,000 steps, the model achieves low loss on both the training data and the validation sentence. Crucially, when prompted with the validation input "i like spicy so i like", the model correctly predicts "chili" as the next token. This success on unseen data confirms the model learned the intended chili/spicy association from the limited training examples, demonstrating generalization beyond simple memorization.

The Transformer model itself is a decoder-only model drastically scaled down compared to typical Large Language Models (LLMs). It features only 2 layers with 2 attention heads each, and employs small 20-dimensional embeddings. Furthermore, it uses tied word embeddings (the same matrix for input lookup and output prediction, also used in Google's Gemma), reducing parameters and linking input/output representations in the same vector space which is helpful for visualization. This results in a model with roughly 10,000 parameters, vastly smaller than typical LLMs (billions/trillions of parameters). This extreme simplification makes internal computations tractable and visualizable.

Tokenization is kept rudimentary. Instead of complex subword methods like Byte Pair Encoding (BPE), a simple regex splits text primarily into words. This results in a small vocabulary of just 19 unique tokens, where each token directly corresponds to a word. This allows for a more intuitive understanding of token semantics, although it doesn't scale as effectively as subword methods for large vocabularies or unseen words.

A single, distinct sentence is held out as a validation set. This sentence tests whether the model has truly learned the semantic link between "chili" and "spicy" (which only appear together differently in training) or if it has merely memorized the training sequences.

A highly structured and minimal training dataset focused on simple relationships between a few concepts: fruits and tastes. Unlike vast text corpora, this dataset features repetitive patterns and clear semantic links, making it easier to observe how the model learns specific connections.

This article employs a strategy of radical simplification across three key components: the training data, the tokenization method, and the model architecture. While significantly scaled down, this setup allows for detailed tracking and visualization of internal states. Fundamental mechanisms observed here are expected to mirror those in larger models.

Visualizing the Internals

While Transformer implementations operate on multi-dimensional tensors for efficiency in order to handle batches of sequences and processing entire context windows in parallel, we can simplify our conceptual understanding. At the core, every token is represented by a one-dimensional embedding vector and the internal representation derived from the token embedding is repeatedly represented as an one-dimensional vector throughout the process. This property can be used for visualization.

Token Embeddings

... continue reading