Large language models (LLMs) still feel a bit like magic to me. Of course, I understand the general machinery enough to know that they aren’t, but the gap between my outdated knowledge of the field and the state-of-the-art feels especially large right now. Things are moving fast. So six months ago, I decided to close that gap just a little by digging into what I believed was one of the core primitives underpinning LLMs: the attention mechanism in neural networks.
I started by reading one of the landmark papers in the literature, which was published by Google Brain in 2017 under the catchy title Attention is all you need (Vaswani et al., 2017). As the title suggests, the authors did not invent the attention mechanism. Rather, they introduced a neural network architecture which in was some sense “all attention”. This architecture is the now-famous transformer. Clearly the transformer stands in contrast to whatever came before it, but what was that and what did the transformer do differently?
To answer these questions, I read a lot of papers, and the context that felt natural to provide here grew the more that I read. I went down the rabbit hole, and when I came out, I realized that what had started as a study of attention had grown into a bigger story. Attention is still the throughline, but there are other important themes, such as how neural networks generalize and the bitter lesson that simple methods that scale seem to triumph over clever methods which do not. This post is the product of that deep dive, and it is a stylized history of LLMs.
As a caveat, real life is endlessly detailed, and any summary or synthesis inevitably flattens this detail. So I will accidentally or intentionally skip over many important and related papers and ideas in the service of a synthesis. I will also skip over practicalities such as data preprocessing and advances in hardware and computing. My focus will be on what I view as the main methodological landmarks, and this history is simply one of many ways to tell this story.
Distributed representations
I’ll start with an old idea, one so ubiquitous today that it might seem silly to belabor here. The idea is that neural networks automatically generalize using distributed representations. This idea has its roots in computational neuroscience, particularly Connectionism (McCulloch & Pitts, 1943) and was discussed explicitly in the 1980s in papers like Learning representations by back-propagating errors (Rumelhart et al., 1986) and Learning distributed representations of concepts (Hinton, 1986). Understanding it is key to understanding why LLMs work at all and thus understanding the long line of academic research driving towards them.
But first, a problem. The goal of natural language processing (NLP) is to model human language using computers. Until the 1980s, NLP systems were mostly based on handwritten rules and handcrafted features. However, by the early 1990s, researchers were exploring the use of statistical methods from machine learning. For an early and seminal example, see A statistical approach to machine translation (Brown et al., 1990).
The core idea of statistical NLP is to model human language using a statistical language model, which is a probability distribution over all possible sequences in a language. This distribution is typically factorized such that each word depends on all words that precede it:
p ( w 1 : T ) = ∏ t = 1 T p ( w t ∣ w 1 : t − 1 ) . (1) p(w_{1:T}) = \prod_{t=1}^T p\left(w_t \mid w_{1:t-1} \right). \tag{1} p(w1:T)=t=1∏Tp(wt∣w1:t−1).(1)
Throughout this post, I will use the notation w i : j w_{i:j} wi:j to denote elements in a sequence from positions i i i to j j j inclusive (where i ≤ j i \leq j i≤j):
... continue reading