How LLMs Actually Work
This post is a walkthrough of how LLMs work. Modern LLMs are mostly built by stacking transformer blocks over and over, so understanding the transformer machinery gets you most of the way there.
I’ll cover the core mechanisms inside modern transformer-based LLMs, without all that sticky math stuff. Don’t get me wrong, you should learn the math, but this can serve as an introduction.
Most modern LLMs share the same transformer-family skeleton. The differences come from what each one was trained on, the scale and configuration choices, and the post-training done on top. By the end, you should be able to read many modern LLM papers or model cards and know which piece of the architecture each section is talking about.
Here’s the path:
Tokens, how a string of text becomes a sequence of integers Embeddings, how those integers get meaning Positional encoding, how the model knows what order the tokens came in Attention, how tokens share information with each other Multi-head attention, how the model tracks many kinds of relationships at once The feed-forward network, where a large share of the model’s stored structure lives The residual stream and layer normalization, what makes deep stacks trainable Predicting the next token, what the model actually outputs and how the generation loop works Architecture vs trained weights, what’s broadly shared across modern LLMs, and what’s different
Tiny explainers appear throughout so anyone can follow along, regardless of background.
Tokenization
Models don’t read text directly. They read integer IDs. The step that converts your prompt into a sequence of those integers.
That conversion step is called tokenization. A tokenizer takes a string and produces a sequence of integers, where each integer points to an entry in a fixed vocabulary. Modern LLM vocabularies usually contain tens of thousands to a few hundred thousand entries.
... continue reading