Large language models (LLMs) still feel a bit like magic to me. Of course, I understand the general machinery enough to know that they aren’t, but the gap between my outdated knowledge of the field and the state-of-the-art feels especially large right now. Things are moving fast. So six months ago, I decided to close that gap just a little by digging into what I believed was one of the core primitives underpinning LLMs: the attention mechanism in neural networks. I started by reading one of the landmark papers in the literature, which was published by Google Brain in 2017 under the catchy title Attention is all you need (Vaswani et al., 2017). As the title suggests, the authors did not invent the attention mechanism. Rather, they introduced a neural network architecture which in was some sense “all attention”. This architecture is the now-famous transformer. Clearly the transformer stands in contrast to whatever came before it, but what was that and what did the transformer do differently? To answer these questions, I read a lot of papers, and the context that felt natural to provide here grew the more that I read. I went down the rabbit hole, and when I came out, I realized that what had started as a study of attention had grown into a bigger story. Attention is still the throughline, but there are other important themes, such as how neural networks generalize and the bitter lesson that simple methods that scale seem to triumph over clever methods which do not. This post is the product of that deep dive, and it is a stylized history of LLMs. As a caveat, real life is endlessly detailed, and any summary or synthesis inevitably flattens this detail. So I will accidentally or intentionally skip over many important and related papers and ideas in the service of a synthesis. I will also skip over practicalities such as data preprocessing and advances in hardware and computing. My focus will be on what I view as the main methodological landmarks, and this history is simply one of many ways to tell this story. Distributed representations I’ll start with an old idea, one so ubiquitous today that it might seem silly to belabor here. The idea is that neural networks automatically generalize using distributed representations. This idea has its roots in computational neuroscience, particularly Connectionism (McCulloch & Pitts, 1943) and was discussed explicitly in the 1980s in papers like Learning representations by back-propagating errors (Rumelhart et al., 1986) and Learning distributed representations of concepts (Hinton, 1986). Understanding it is key to understanding why LLMs work at all and thus understanding the long line of academic research driving towards them. But first, a problem. The goal of natural language processing (NLP) is to model human language using computers. Until the 1980s, NLP systems were mostly based on handwritten rules and handcrafted features. However, by the early 1990s, researchers were exploring the use of statistical methods from machine learning. For an early and seminal example, see A statistical approach to machine translation (Brown et al., 1990). The core idea of statistical NLP is to model human language using a statistical language model, which is a probability distribution over all possible sequences in a language. This distribution is typically factorized such that each word depends on all words that precede it: p ( w 1 : T ) = ∏ t = 1 T p ( w t ∣ w 1 : t − 1 ) . (1) p(w_{1:T}) = \prod_{t=1}^T p\left(w_t \mid w_{1:t-1} \right). \tag{1} p(w1:T​)=t=1∏T​p(wt​∣w1:t−1​).(1) Throughout this post, I will use the notation w i : j w_{i:j} wi:j​ to denote elements in a sequence from positions i i i to j j j inclusive (where i ≤ j i \leq j i≤j): w i : j : = { w i , w i + 1 , … , w j − 1 , w j } . (2) w_{i:j} := \{w_i, w_{i+1}, \dots, w_{j-1}, w_j\}. \tag{2} wi:j​:={wi​,wi+1​,…,wj−1​,wj​}.(2) Given a good statistical model p ( w 1 : T ) p(w_{1:T}) p(w1:T​), we can do many things. For example, we can rank the likelihood of different sequences of words and use that ranking to decide on things like a conversational agent’s output. Or we can translate a source sequence s 1 : T s_{1:T} s1:T​ into a target sequence w 1 : T w_{1:T} w1:T​ if we have the conditional probabilities between the two: p ( w 1 : T ∣ s 1 : T ) ∝ p ( s 1 : T ∣ w 1 : T ) p ( w 1 : T ) . (3) p(w_{1:T} \mid s_{1:T}) \propto p(s_{1:T} \mid w_{1:T}) p(w_{1:T}). \tag{3} p(w1:T​∣s1:T​)∝p(s1:T​∣w1:T​)p(w1:T​).(3) Here, p ( w 1 : T ) p(w_{1:T}) p(w1:T​) would be our language model of the target language, and p ( s 1 : T ∣ w 1 : T ) p(s_{1:T} \mid w_{1:T}) p(s1:T​∣w1:T​) would be our translation model. Today, this view is so pervasive that it might feel obvious, but with a little imagination, I think it’s easy to see how wrong this might have felt to a linguist forty-odd years ago. Equation 1 1 1 captures no language structure or parts of speech such as nouns or verbs or adjectives—see e.g. (Chomsky, 1956) on formal grammars. Instead, it reduces the complexity of human language to next-word prediction. If we didn’t know already that this worked, we might doubt that it would. More importantly for us, estimating the model in Equation 1 1 1 is hard! The main challenge is the curse of dimensionality. There are many, many words in a vocabulary. For example, linguists estimate that English has roughly a million words, give or take a few hundred thousand depending on how you count them. Furthermore, this problem explodes in some tasks such as translation, where there are many possible conditional probabilities p ( s 1 : T ∣ w 1 : T ) p(s_{1:T} \mid w_{1:T}) p(s1:T​∣w1:T​). So when estimating the conditional probabilities of our language model, we cannot possibly encounter all possible combinations. We have a data sparsity problem, and estimating the true probabilities becomes impossible. Perhaps the oldest idea to tackle this problem was proposed in Andrey Markov’s pioneering mathematical analysis of Pushkin’s Eugene Onegin (Markov, 1913). He made the assumption that each conditional probability in Equation 1 1 1 only depends on the previous N N N terms: p ( w 1 : T ) = ∏ t = 1 T p ( w t ∣ w 1 : t − 1 ) ≈ ∏ t = 1 T p ( w t ∣ w t − N : t − 1 ) . (4) p(w_{1:T}) = \prod_{t=1}^T p \left( w_t \mid w_{1:t-1} \right) \approx \prod_{t=1}^T p \left(w_t \mid w_{t-N:t-1} \right). \tag{4} p(w1:T​)=t=1∏T​p(wt​∣w1:t−1​)≈t=1∏T​p(wt​∣wt−N:t−1​).(4) Today, we would call this a “Markov assumption”, and Equation 4 4 4 is the famous N N N-gram model. Particularly for small N N N, say N = 1 N=1 N=1 or N = 2 N=2 N=2, we might be able to get reasonable estimates of data. But here is the problem, and this problem is a central theme driving towards the attention mechanism: the Markov assumption destroys context. Without more context, a language model can never replicate the complexity and nuance of natural language. As I understand it, this was conceptually the state of the field circa 2000. But then in 2003, a seminal paper was published: A neural probabilistic language model (Bengio et al., 2003). In that paper, the authors proposed a novel idea: to avoid this data sparsity problem, this curse of dimensionality, we can use neural networks to learn a language model using what they call “distributed representations” of words. (Today, we might call these “word embeddings”.) They proposed three core ideas. First, they represented each word as a real-valued vector or embedding; then, they expressed Equation 1 1 1 in terms of these embeddings; and finally, they trained a neural network to simultaneously learn the embeddings and the parameters of the probability function (neural network) in Equation 1 1 1 using back-propagation (Rumelhart et al., 1986). That’s a lot, so let’s break it down a bit. Our goal here is to learn a good model f Θ f_{\Theta} fΘ​ of natural language such that p ( w t ∣ w 1 : t − 1 ) ≈ f Θ ( w t − 1 , … , w t − N ) . (5) p(w_t \mid w_{1:t-1}) \approx f_{\boldsymbol{\Theta}}(w_{t-1}, \dots, w_{t-N}). \tag{5} p(wt​∣w1:t−1​)≈fΘ​(wt−1​,…,wt−N​).(5) So the left-hand side is the true conditional distribution, capturing next-word prediction. It’s the goal of language modeling. But in practice, modeling the full context is hard. So we settle for the right hand side, which is a parametric approximation f Θ f_{\boldsymbol{\Theta}} fΘ​ of this true distribution with context window of size N N N. In Bengio, they model f Θ f_{\boldsymbol{\Theta}} fΘ​ using two components. First, they represent words as vectors. Let V \mathcal{V} V denote our vocabulary, which is simply a set of integers V = { 1 , 2 , … , V } \mathcal{V} = \{1, 2,\dots, V\} V={1,2,…,V} indexing all V V V words in a language. We will represent each word as a D D D-vector, and so we can represent the entire language as a matrix C ∈ R V × D \mathbf{C} \in \mathbb{R}^{V \times D} C∈RV×D (Figure 1 1 1). Now for the t t t-th word in a sequence w 1 : T w_{1:T} w1:T​, we have an associated index in the vocabulary, which we will denote as I ( w t ) ∈ V I(w_t) \in \mathcal{V} I(wt​)∈V. This notation might be a bit odd, but I’m careful here because w t w_t wt​ is not a well-defined mathematical object, and it cannot index C \mathbf{C} C. But I ( w t ) I(w_t) I(wt​) is an integer and can index C \mathbf{C} C, and so c I ( w t ) \mathbf{c}_{I(w_t)} cI(wt​)​ is a D D D-dimensional vector (a row vector of C \mathbf{C} C) representing the I ( w t ) I(w_t) I(wt​)-th word in the vocabulary, associated with the t t t-th word in the sequence. This vector is what we are calling an “embedding” or “distributed representation”. Second, Bengio et al represent the probability function over words (Equation 1 1 1) as as a feed-forward neural network g g g with parameters Ω \boldsymbol{\Omega} Ω and arguments C \mathbf{C} C: f Θ ( w t − 1 , … , w t − N ) = g Ω ( c I ( w t − 1 ) , … , c I ( w t − N ) ) . (6) f_{\boldsymbol{\Theta}}(w_{t-1}, \dots, w_{t-N}) = g_{\boldsymbol{\Omega}}\left(\mathbf{c}_{I(w_{t-1})}, \dots, \mathbf{c}_{I(w_{t-N})}\right). \tag{6} fΘ​(wt−1​,…,wt−N​)=gΩ​(cI(wt−1​)​,…,cI(wt−N​)​).(6) They then use back-propagation to jointly estimate the parameters Θ : = { C , Ω } . (7) \boldsymbol{\Theta} := \{\mathbf{C}, \boldsymbol{\Omega}\}. \tag{7} Θ:={C,Ω}.(7) In other words, they learn the neural network parameters Ω \boldsymbol{\Omega} Ω at the same time as learning the word embeddings C \mathbf{C} C. Note that “distributed representation” can refer to either the continuously-valued vector, e.g. word embedding, or the concept distributed across neurons. This duality is exemplified in C \mathbf{C} C which is both a set of learnable parameters and the embeddings themselves! Why might this work? The authors explain the idea so well that it’s worth just quoting the original paper: In the proposed model, it will so generalize because “similar” words are expected to have a similar feature vector, and because the probability function is a smooth function of these feature values, a small change in the features will induce a small change in the probability. Therefore, the presence of only one of the above sentences in the training data will increase the probability, not only of that sentence, but also of its combinatorial number of “neighbors” in sentence space. This is a beautiful idea. If we have word embeddings that are “well-organized” in the sense that words that play similar roles in sentences (semantically and syntactically) have similar embeddings and if we have a smooth function from word embeddings to probabilities, then small changes in words lead to small changes in embeddings which lead to small changes in probabilities (Figure 2 2 2). Pause for a moment to really think about this. Words are discrete objects, and a “small change in a word”, while intuitive to humans, is ill-defined. But this approach concretizes what that means. To quote the paper Linguistic regularities in continuous space word representations (Mikolov et al., 2013), which we’ll discuss later: Whereas an N N N-gram model works in terms of discrete units that have no inherent relationship to one another, a continuous space model works in terms of word vectors where similar words are likely to have similar vectors. Thus, when the model parameters are adjusted in response to a particular word or word-sequence, the improvements will carry over to occurrences of similar words and sequences. For example, if the words “dog” and “cat” are nearby in word-embedding space, then maybe “The cat is walking on the sidewalk” and “The dog is walking on the sidewalk” should have similar probabilities. And only one of these two sentences would need to exist in the training data for the model to generalize well to both sentences! As I mentioned, this idea was not entirely new in 2003. Since the 1980s, researchers had known that neural networks can generalize because they distribute their representation across many neurons (Hinton, 1986). Each new example modifies the weights, incorporating new knowledge into the old. However (Bengio et al., 2003) is a landmark paper in NLP because it was the first application of this idea to language modeling. The Bengio paper took seriously the idea that we could build a statistical model of language using the distributed representations of words. It was the first hint that we could use neural networks to overcome the curse of dimensionality that plagued statistical NLP. Autoregressive framework This is a promising idea, but we glossed over an important detail: how do we actually train this model? What is the loss function or objective that the neural network should use? And given a fit model, how do we generate a new sequence? These are important questions to answer per se, but they are also important questions because, at a conceptual level, there is really no difference between Bengio’s model and the frontier large language models today. So understanding this is critical to understanding LLMs. Both are autoregressive models and trained using next-word prediction. As an example, imagine we have the following input sentence, which is a quote from Virginia Woolf’s A Room of One’s Own: “Intellectual freedom depends upon material things.” (8) \text{``Intellectual freedom depends upon material things.''} \tag{8} “Intellectual freedom depends upon material things.”(8) Now imagine that our model’s context window has size N = 2 N=2 N=2 and let c p \mathbf{c}_p cp​ denote a padding D D D-vector of all zeros. In Bengio’s model, we would start by representing just the first word, “intellectual”, as a word embedding. So the first non-zero input to our model would be: x 2 = [ c p c I ( w 1 ) ] = [ c p c I ( “intellectual” ) ] . (9) \mathbf{x}_2 = \left[ \begin{array}{l} \mathbf{c}_p \\ \mathbf{c}_{I(w_1)} \end{array} \right] = \left[ \begin{array}{l} \mathbf{c}_p \\ \mathbf{c}_{I(\text{``intellectual''})} \end{array} \right]. \tag{9} x2​=[cp​cI(w1​)​​]=[cp​cI(“intellectual”)​​].(9) The output of the neural network would be a V V V-dimensional vector representing the probability distribution over p ( w 2 ∣ w 1 ) p(w_2 \mid w_1) p(w2​∣w1​). Illustratively: y 2 = [ p ( w 2 = “about” ) p ( w 2 = “above” ) ⋮ p ( w 2 = “freedom” ) ⋮ ] . (10) \mathbf{y}_2 = \left[ \begin{array}{l} p(w_2 = \text{``about''}) \\ p(w_2 = \text{``above''}) \\ \qquad\;\;\vdots \\ p(w_2 = \text{``freedom''}) \\ \qquad\;\;\vdots \\ \end{array} \right]. \tag{10} y2​=⎣⎢⎢⎢⎢⎢⎢⎢⎡​p(w2​=“about”)p(w2​=“above”)⋮p(w2​=“freedom”)⋮​⎦⎥⎥⎥⎥⎥⎥⎥⎤​.(10) We would then compute the cross-entropy loss between this output vector and the true distribution, which is really just a one-hot vector with 1 1 1 for the word “freedom” and 0 0 0 everywhere else. We would then repeat this process on the next word. So the next input sequence would be x 3 = [ c I ( “intellectual” ) c I ( “freedom” ) ] , (11) \mathbf{x}_3 = \left[ \begin{array}{l} \mathbf{c}_{I(\text{``intellectual''})} \\ \mathbf{c}_{I(\text{``freedom''})} \end{array} \right], \tag{11} x3​=[cI(“intellectual”)​cI(“freedom”)​​],(11) and the output would represent the probability distribution p ( w 3 ∣ w 1 : 2 ) p(w_3 \mid w_{1:2}) p(w3​∣w1:2​). And again, we would minimize the cross-entropy loss between its associated output vector and a one-hot vector encoding the word “depends”. We would repeat this process until the end of the sentence. Of course, longer sequences are more expensive to train in this way, and this is precisely the point of the context window in Bengio’s paper. We only consider the N N N previous words when predicting the next word. This idea of a limited context window is critical, as it is a constraint that persists into the present day. In this example, since N = 2 N=2 N=2, the third input would be x 4 = [ c I ( “freedom” ) c I ( “depends” ) ] . (12) \mathbf{x}_4 = \left[ \begin{array}{l} \mathbf{c}_{I(\text{``freedom''})} \\ \mathbf{c}_{I(\text{``depends''})} \end{array} \right]. \tag{12} x4​=[cI(“freedom”)​cI(“depends”)​​].(12) So the model completely loses the word “intellectual”. It is now outside the context. Since minimizing the cross-entropy loss is equivalent to maximizing the log likelihood—see here for an example if this idea is new to you—we can generalize the logic above by saying that we want to maximize the log likelihood of our training data, again using a neural network as a parametric function approximation of the true distribution: Θ ⋆ = arg ⁡ ⁣ max ⁡ Θ { ∑ t = 1 T log ⁡ g Ω ( c I ( w t − N ) , … , c I ( w t − 1 ) ) } . (13) \boldsymbol{\Theta}^{\star} = \arg\!\max_{\boldsymbol{\Theta}} \left\{ \sum_{t=1}^T \log g_{\boldsymbol{\Omega}} \left(\mathbf{c}_{I(w_{t-N})}, \dots, \mathbf{c}_{I(w_{t-1})} \right) \right\}. \tag{13} Θ⋆=argΘmax​{t=1∑T​loggΩ​(cI(wt−N​)​,…,cI(wt−1​)​)}.(13) Of course, we can estimate Θ ⋆ \boldsymbol{\Theta}^{\star} Θ⋆ by minimizing the negative log likelihood using gradient descent via back-propagation. That’s it. At the conceptual level, this framework is no different from how frontier large language models are trained today. As we will see later though, there is a lot of additional machinery that is needed to make these models work in practice. Finally, imagine we fit our model, meaning we find good parameters Θ ⋆ \boldsymbol{\Theta}^{\star} Θ⋆ that maximize our log likelihood. How can we use these parameters to generate a random sequence or sentence? We could draw the first word at random from the vocabulary. And then we could draw the next word conditional on the first word from our parametric approximation of p ( w 2 ∣ w 1 ) p(w_2 \mid w_1) p(w2​∣w1​). And then we could draw the third word conditional on the second and first words from our parametric approximation of p ( w 3 ∣ w 1 : 2 ) p(w_3 \mid w_{1:2}) p(w3​∣w1:2​). And so on. This is why LLMs can both understand natural language and generate new sentences. They are not just descriptive models; they are generative models. There are some subtleties I am glossing over, such as special embeddings to denote the start and end of a sequence, preprocessing steps like lowercasing words, tokenization, and handling out-of-vocabulary words. But I don’t think these details matter much here. As an aside, we can call any model trained in this way autoregressive. In statistics, an autoregressive model is any model where a variable is predicted using its own previous values. A classic example of this are AR models such as AR(1). The breakthrough While (Bengio et al., 2003) was a landmark paper, its full impact was delayed by roughly a decade. This is because training neural networks was hard at the time. It’s worth checking out that paper and seeing just how primitive the engineering feels today. For example, they trained on CPUs and without modern tooling like automatic differentiation libraries. In the intervening decade, there was some early work that built on Bengio’s model. For example, in A unified architecture for natural language processing: Deep neural networks with multitask learning (Collobert & Weston, 2008), the authors demonstrate that Bengio’s neural language model could be trained and used on a variety of downstream tasks. And in Word representations: A simple and general method for semi-supervised learning (Turian et al., 2010), the authors demonstrate that word embeddings improve state-of-the-art NLP systems when included as additional features. But none of these contributions were convincing demonstrations of Bengio’s main idea. So seven years after Bengio et al, it was N N N-grams, not neural networks, which were still the state-of-the-art, at least in practice and outside specialized benchmarks. Honestly, I found this surprising, but I kept reading this claim in various papers. For example, in the introduction to Recurrent neural network based language model (Mikolov et al., 2010), the authors wrote: It is questionable if there has been any significant progress in language modeling over simple N N N-gram models… In fact, most of the proposed advanced language modeling techniques provide only tiny improvements over simple baselines, and are rarely used in practice. Or two years after that, in A fast and simple algorithm for training neural probabilistic language models (Mnih & Teh, 2012), the authors wrote: In spite of their superior performance, neural probabilistic language models remain far less widely used than N N N-gram models due to their notoriously long training times, which are measured in weeks even for moderately-sized datasets. Of course, advanced techniques existed and were well known, but they were often impractical. So roughly a hundred years after Andrey Markov’s pioneering work, researchers were still struggling to represent human language in a form amenable for mathematics and computation, and N N N-grams were still considered a reasonable choice in NLP. Today, neural networks are definitively state-of-the-art. What changed? The answer is that we learned to train variants of Bengio’s model at scale. Around 2012, researchers were finally able to train neural networks on large datasets. My understanding is that it was the so-called “AlexNet” paper, ImageNet classification with deep convolutional neural networks (Krizhevsky et al., 2012), that convinced many in the research community to pay attention. Convolutional neural networks were already well known and had been trained on small datasets since the 1980s (LeCun et al., 1989). But AlexNet was the first time a deep convolutional neural network was trained end-to-end on a very large (at the time) dataset, ImageNet (Deng et al., 2009) and using GPUs. The results were a tour de force. To quote the paper: We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top- 5 5 5 test error rate of 15.3 % 15.3\% 15.3%, compared to 26.2 % 26.2\% 26.2% achieved by the second-best entry. In other words, AlexNet demolished the state-of-the-art in computer vision. It achieved a roughly 40 % 40\% 40% reduction in relative error rate. Nothing else came close. As a comparison, the current fastest time for a men’s marathon is 2 hours and 35 seconds. The previous record was 2 hours and 69 seconds, so 34 seconds slower. Now imagine if someone came along and beat the record by half an hour. It would revolutionize the running world. At the time, computer vision was still dominated by handcrafted feature pipelines, and so the AlexNet results were extremely surprising. For example, in Introduction to the bag of features paradigm for image classification and retrieval (O’Hara & Draper, 2011), the authors wrote: The past decade has seen the growing popularity of Bag of Features (BoF) approaches to many computer vision tasks, including image classification, video search, robot localization, and texture recognition… BoF-based systems have set new performance standards on popular image classification benchmarks and have achieved scalability breakthroughs in image retrieval. This introduction to bag of feature models was put on arXiv in January 2011, whereas AlexNet was published at NeurIPS in December 2012, meaning that the claim above was contemporaneous with the training of AlexNet! My point here is to underscore just how surprising the rise of neural networks was. To be clear, I am sure many in the research community believed neural networks would work—Hinton has been a believer since probably the 1970s—, but this was hardly the consensus view that it is today. So the year 2012 was a changepoint. In 2003, Bengio et al set the stage conceptually. In 2012, Krizhevsky et al set the stage technologically. Shallow log-linear models With hindsight, the obvious implication of AlexNet was that NLP researchers circa 2012 should try to train neural networks at scale. Of course, many researchers tried, but let’s ground ourselves in one particular model. This will help focus the narrative. To my knowledge, two of the earliest and most successful papers to try this idea were Efficient estimation of word representations in vector space (Mikolov et al., 2013) and Distributed representations of words and phrases and their compositionality (Mikolov et al., 2013). These papers are tightly related by both authorship and time, and together, they helped unlock the core ideas in Bengio’s paper, as well as introduce the famous word2vec model. So I think it’s fair to treat them as both a unit and as a landmark in our story. To understand these two papers, we need to understand the computational problems Bengio faced, which means we need to understand the model in more technical detail. Let x t \mathbf{x}_t xt​ be the input to the model, and y t \mathbf{y}_t yt​ be the output. Bengio’s model did not support variable-length inputs, and thus the input sequence could be only a fixed number of N N N words, each represented as an D D D-dimensional embedding. Let’s represent this input as the concatenation of N N N different D D D-vectors from C \mathbf{C} C mentioned above, so: x t : = [ c I ( w t − 1 ) ⋮ c I ( w t − N + 1 ) ] . (14) \mathbf{x}_t := \left[ \begin{array}{l} \mathbf{c}_{I(w_{t-1})} \\ \quad\quad\vdots \\ \mathbf{c}_{I(w_{t-N+1})} \end{array} \right]. \tag{14} xt​:=⎣⎢⎢⎡​cI(wt−1​)​⋮cI(wt−N+1​)​​⎦⎥⎥⎤​.(14) One way we can imagine constructing x t \mathbf{x}_t xt​ is if we represent every word in our context window as a V V V-dimensional one-hot vector. Call this a matrix Q t ∈ R N × V \mathbf{Q}_t \in \mathbb{R}^{N \times V} Qt​∈RN×V. Then x t = Q t C \mathbf{x}_t = \mathbf{Q}_t \mathbf{C} xt​=Qt​C gives us the associated embeddings. In practice, though, we would never do a dense matrix multiplication with complexity O ( V N D ) \mathcal{O}(VND) O(VND). Instead, we would simply index into C \mathbf{C} C. So this operation has computational complexity O ( N D ) \mathcal{O}(ND) O(ND). I only belabor this point because I found it confusing when first reading Bengio’s paper. (This point is made more clearly in (Collobert & Weston, 2008)) After construction, this input x t \mathbf{x}_t xt​ is then fed into an extremely simple (relative to today’s models) architecture, a feed-forward neural network with a linear projection layer and a nonlinear hidden layer: g Ω ( x t ) = y t : = b + W x t + U tanh ⁡ ( z t ) , z t : = d + H x t . (15) \begin{aligned} g_{\boldsymbol{\Omega}}(\mathbf{x}_t) = \mathbf{y}_t &:= \mathbf{b} + \mathbf{Wx}_t + \mathbf{U} \tanh(\mathbf{z}_t), \\ \mathbf{z}_t &:= \mathbf{d} + \mathbf{Hx}_t. \end{aligned} \tag{15} gΩ​(xt​)=yt​zt​​:=b+Wxt​+Utanh(zt​),:=d+Hxt​.​(15) The output y t ∈ R V \mathbf{y}_t \in \mathbb{R}^{V} yt​∈RV represents the un-normalized probability of each word in the vocabulary. If normalized, this vector would represent the probability distribution we discussed in the autoregressive framework. Here, we see that that W ∈ R V × N D \mathbf{W} \in \mathbb{R}^{V \times ND} W∈RV×ND is a linear projection of the input embeddings x t \mathbf{x}_t xt​, that H ∈ R H × N D \mathbf{H} \in \mathbb{R}^{H \times ND} H∈RH×ND is a linear projection into a hidden state vector z t ∈ R H \mathbf{z}_t \in \mathbb{R}^H zt​∈RH, and that U ∈ R V × H \mathbf{U} \in \mathbb{R}^{V \times H} U∈RV×H is a linear projection of the nonlinear hidden state vector. So clearly the parameters mentioned in Equation 7 7 7 can be concretized as { C , Ω } : = { C , b , W , U , d , H } . (16) \{\mathbf{C}, \boldsymbol{\Omega}\} := \{\mathbf{C, b, W, U, d, H}\}. \tag{16} {C,Ω}:={C,b,W,U,d,H}.(16) So why was this expensive to train? We can see that the computational complexity to compute y t \mathbf{y}_t yt​ is proportional to: N D ⏟ Q C + V N D ⏟ W x t + V H ⏟ U tanh ⁡ ( z t ) + H N D ⏟ H x t . (17) \underbrace{ND}_{\mathbf{QC}} \;\;\;+\;\;\; \underbrace{VND}_{\mathbf{Wx}_t} \;\;\;\;+\; \underbrace{VH}_{\mathbf{U} \tanh(\mathbf{z}_t)} +\;\; \underbrace{HND}_{\mathbf{Hx}_t}. \tag{17} QC ND​​+Wxt​ VND​​+Utanh(zt​) VH​​+Hxt​ HND​​.(17) Note that this complexity is for every single word in the corpus, and we must also account for the number of training epochs. In (Mikolov et al., 2013), the authors write that a “common choice” is N = 10 N=10 N=10 and that N D ND ND is typically around 500 500 500 to 2000 2000 2000. However, the hidden layer has dimension H H H (commonly around 2000 2000 2000 or so) and this is multiplied by the size of the vocabulary! What does this mean? The dominating term in Equation 17 17 17 is V H VH VH. Furthermore, this complexity is just for computing the un-normalized probabilities y t \mathbf{y}_t yt​. To normalize these, we must compute the softmax function over the size of the vocabulary V V V: p ( w t ∣ w t − N : t − 1 ) = exp ⁡ ( y t ) ∑ i = 1 V exp ⁡ ( y i ) . (18) p(w_t \mid w_{t-N:t-1}) = \frac{\exp\left(\mathbf{y}_t\right)}{\sum_{i=1}^V \exp\left( \mathbf{y}_i \right)}. \tag{18} p(wt​∣wt−N:t−1​)=∑i=1V​exp(yi​)exp(yt​)​.(18) As I understand it, these were the computational problems Bengio faced. The two Mikolov papers did not present a single trick to solve them. Rather, the papers made a number of modeling choices, mostly already established in the literature, that in combination finally made learning distributed representations of words scalable. First, in the first paper, they avoided computing the full softmax function using hierarchical softmax, introduced by Morin and Bengio in Hierarchical probabilistic neural network language model (Morin & Bengio, 2005). I don’t think the details of this matter much here. See this blog post for a nice explanation with code. Suffice to say that it’s an efficient way to compute the normalized probabilities in Equation 18 18 18. The computational complexity is reduced from O ( V ) \mathcal{O}(V) O(V) to O ( log ⁡ 2 V ) \mathcal{O}(\log_2 V ) O(log2​V). In the second paper, they further sped up the softmax computation by introducing a technique called negative sampling. The theory here is rich and deserving of its own post, but the main idea is to draw K K K samples from a noise distribution and train the model to disambiguate observations from noise. The important point here is that one can prove this converges to the correct probabilities without explicitly computing the normalizing constant. See (Gutmann & Hyvärinen, 2010) for details. We don’t need to fully grok these techniques; just know that these two approaches are both ways of getting around the expensive normalization in Equation 18 18 18. For example, if V = 1 × 1 0 6 V = 1\times 10^6 V=1×106, then log ⁡ 2 ( V ) ≈ 20 \log_2(V) \approx 20 log2​(V)≈20. And in the second paper, they chose K K K to be 2 2 2 to 20 20 20 depending on the dataset. Second, they stripped out the non-linear part of Bengio’s model (so removing U tanh ⁡ ( z t ) \mathbf{U} \tanh(\mathbf{z}_t) Utanh(zt​)), reducing the model to a simple linear operation: a dot product. The result is model that is log-linear on the features, which I’ll explain in a moment. Now the models. In the first paper, they presented two models, a continuous bag-of-words model (CBOW) and a continuous skip-gram model (skip-gram). These are the foundations of the word2vec NLP toolkit. In the CBOW model, a set of neighboring words are averaged to predict a target word; and in the skip-gram model, a target word is used to predict its neighboring words (Figure 3 3 3). Both worked empirically in practice, but the authors only built on the skip-gram model in the second paper. And since I don’t think it’s that important here to understand both, I’ll just focus on the skip-gram model. Let’s build a little intuition by going into detail. The objective of the skip-gram model is to minimize the cross-entropy loss between a single target word and its neighboring words. So the input to the model is only a single D D D-vector representing a single word (so no context window). The output, however, are the N N N words surrounding the input. Let N = 2 C N = 2C N=2C. Then the objective function is: 1 T ∑ t = 1 T ∑ − C ≤ j ≤ C , j ≠ 0 log ⁡ p ( w t + j ∣ w t ) . (19) \frac{1}{T} \sum_{t=1}^T \sum_{-C \leq j \leq C,\;j eq 0} \log p(w_{t+j} \mid w_t). \tag{19} T1​t=1∑T​−C≤j≤C,j​=0∑​logp(wt+j​∣wt​).(19) I will continue to use the notation N N N for this context window, but clearly it is different in precise meaning from the N N N in an N N N-gram or the N N N in Bengio’s paper. We model the conditional probability in Equation 19 19 19 via a simple log-linear function: p ( w t + j ∣ w t ) = p ( u I ( w t + j ) ∣ c I ( w t ) ) = exp ⁡ ( ⟨ u I ( w t + j ) , c I ( w t ) ⟩ ) ∑ i ∈ V exp ⁡ ( ⟨ u i , c I ( w t ) ⟩ ) (20) p(w_{t+j} \mid w_t) = p(\mathbf{u}_{I(w_{t+j})} \mid \mathbf{c}_{I(w_{t})}) = \frac{\exp\left( \langle \mathbf{u}_{I(w_{t+j})}, \mathbf{c}_{I(w_{t})} \rangle \right)}{\sum_{i \in \mathcal{V}} \exp\left( \langle \mathbf{u}_i, \mathbf{c}_{I(w_{t}) \rangle} \right)} \tag{20} p(wt+j​∣wt​)=p(uI(wt+j​)​∣cI(wt​)​)=∑i∈V​exp(⟨ui​,cI(wt​)⟩​)exp(⟨uI(wt+j​)​,cI(wt​)​⟩)​(20) Here, c i \mathbf{c}_i ci​ are word embeddings of the inputs. These are analogous to the row-vectors of C \mathbf{C} C in Bengio’s model and again are constructed via a lookup. The output embeddings u \mathbf{u} u are a little trickier to interpret. If we were using the full softmax function, we would have V V V such output embeddings, and these would represent the weights of the softmax function. But when using hierarchical softmax or negative sampling, the interpretation changes a bit. Again, I don’t think the details really matter here. The key point is that we take a sequence w 1 : T w_{1:T} w1:T​, select the appropriate embeddings c 1 : T \mathbf{c}_{1:T} c1:T​, and compute Equation 20 20 20 directly, learning both the parameters C \mathbf{C} C and U \mathbf{U} U. This is called a “log-linear model” because the log of the conditional probability is linear with respect to its arguments: log ⁡ p ( w t + j ∣ w t ) = ⟨ u I ( w t + j ) , c I ( w t ) ⟩ − Z , (21) \log p(w_{t+j} \mid w_t) = \langle \mathbf{u}_{I(w_{t+j})}, \mathbf{c}_{I(w_{t})} \rangle - Z, \tag{21} logp(wt+j​∣wt​)=⟨uI(wt+j​)​,cI(wt​)​⟩−Z,(21) Here, I just write Z Z Z to denote the normalizing constant, the denominator in Equation 20 20 20, because it is not particularly interesting, and we do not even need to compute it when using negative sampling. The key relationship that the model is learning is a simple linear weighting of the input embeddings that allow it to predict nearby words. Hopefully, it is clear why this model is so fast to train. We have no hidden layers or nonlinearities. We simply compute a dot product and ignore the normalizing constant. For example, when using the full softmax, the computational complexity is: N ( D + D V ) . (22) N (D + D V). \tag{22} N(D+DV).(22) Here, we have D + D V D + D V D+DV dot products, and we need to do it over N N N words in our context window. However, in practice, we can eliminate V V V entirely, replacing it with something around log ⁡ 2 ( V ) \log_2(V) log2​(V) or K K K. This is significantly smaller than Equation 17 17 17. For example, if we assume that H = D = 500 H=D=500 H=D=500, N = 10 N=10 N=10, and V = 1 × 1 0 6 V=1 \times 10^{6} V=1×106, then hierarchical softmax is five orders of magnitude smaller in terms of complexity. So in these two seminal Mikolov papers, the authors stripped down Bengio’s core idea to a simple log-linear model, and thus were able to train that model at scale. That said, I want to stress a subtlety that took me time to grok. Neither the CBOW nor the continuous skip-gram models presented here are full language models. Notice that their objective functions (nearby-word prediction) are not in the autoregressive framework and thus cannot easily plug into Equation 1 1 1. That’s because the goal of these papers was not to learn a full language model but rather to learn good word embeddings. They say this explicitly in the first paper (emphasis mine): Representation of words as continuous vectors has a long history. A very popular model architecture for estimating neural network language model (NNLM) was proposed in (Bengio et al., 2003), where a feed-forward neural network with a linear projection layer and a non-linear hidden layer was used to learn jointly the word vector representation and a statistical language model. This work has been followed by many others. Another interesting architecture of NNLM was presented in (Mikolov, 2007; Mikolov et al., 2009), where the word vectors are first learned using neural network with a single hidden layer. The word vectors are then used to train the NNLM. Thus, the word vectors are learned even without constructing the full NNLM. In this work, we directly extend this architecture, and focus just on the first step where the word vectors are learned using a simple model. So the word2vec models were simple and shallow (single layer) neural networks designed for fast training and to learn good embeddings. They were not full language models. This is a major distinction from similar prior art, such as A scalable hierarchical distributed language model (Mnih & Hinton, 2008). In this paper, the authors demonstrate more scalable inference of Bengio’s model by representing the vocabulary compactly through binary trees and by using a log-bilinear model. But they go end-to-end to a language model, as the paper title suggests. Mikolov et al’s two models were relentlessly simple and efficient. As I understand it, both CBOW and skip-gram worked well in practice. It did not matter if neighboring words predict a target word or if that target word predicts its neighboring words. The real differentiator was that both models could be efficiently trained at scale. And with scale, something remarkable happened: the authors discovered that distributed representations of words, trained in this fashion, captured semantic and syntactic information. Emergent linguistic regularities Today, linguistic regularities in word embeddings is so well-established that it might seem boring to read here. But understood in context, these regularities should be surprising! How can a simple linear model, trained on essentially next- or nearby-word prediction via maximum likelihood estimation, learn distributed representations of words with remarkable syntactic and semantic properties and relationships? In my mind, this was the first big result that suggested neural networks would not just work but really work in language modeling. The word2vec papers were not the first to observe these properties. My understanding is that that credit goes to yet another Mikolov paper from 2013, Linguistic regularities in continuous space word representations (Mikolov et al., 2013). Here, the authors showed that many semantic and syntactic relationships correspond to approximately constant vector offsets in the embedding’s vector space. To be clear, researchers had long observed that one could uncover structure in vector representations of words. For example, in the 1989 paper Self-organizing semantic maps (Ritter & Kohonen, 1989), the authors trained self-organizing maps (Kohonen, 1982) on pre-computed two-dimensional vectors representing words and demonstrated that these maps contain semantic structure. However, these models were not trained end-to-end (the representations themselves were not learned) and did not have linear structure. It would be a stretch to call these vectors “word embeddings”. But log-linear models like word2vec were remarkable precisely because they enabled analogical reasoning through simple vector offset, i.e. linear operations (Figure 4 4 4)! Perhaps the most famous example of analogical reasoning with word embeddings is the relationship “king is to queen as man is to woman”: vec ( “king” ) − vec ( “man” ) + vec ( “woman” ) ≈ vec ( “queen” ) . (23) \text{vec}\left(\text{``king''}\right) - \text{vec}\left(\text{``man''}\right) + \text{vec}\left(\text{``woman''}\right) \approx \text{vec}\left(\text{``queen''}\right). \tag{23} vec(“king”)−vec(“man”)+vec(“woman”)≈vec(“queen”).(23) Or in (Mikolov et al., 2013), the authors give the example that “Russia” plus “river” is the Volga: vec ( “Russia” ) + vec ( “river” ) ≈ vec ( “Volga River” ) . (24) \text{vec}\left(\text{``Russia''}\right) + \text{vec}\left(\text{``river''}\right) \approx \text{vec}\left(\text{``Volga River''}\right). \tag{24} vec(“Russia”)+vec(“river”)≈vec(“Volga River”).(24) In my mind, these are pretty fascinating and non-obvious results. It suggests that the methods are not mixing vector dimensions in undesirable ways and staying approximately linear. Again, viewed with fresh eyes, it is really quite remarkable! If you were a researcher in 2003 reading Bengio’s paper, would you have predicted this result with high confidence? While these two Mikolov papers are landmark papers on learning word embeddings at scale, they are by no means the only ones. Many other researchers worked in this area. Perhaps the most famous paper on word embeddings that we do not have time to discuss is GloVe: Global vectors for word representation (Pennington et al., 2014). In this paper, the authors present a unifying view between two common methods for learning word embeddings, global matrix factorization methods and local context window methods. But there were many others as well, such as Skip-thought vectors (Kiros et al., 2015), Word embeddings through Hellinger PCA (Lebret & Collobert, 2013), and Eigenwords: spectral word embeddings (Dhillon et al., 2015) to cite just a few illustrative examples. For ease of presentation, I have focused on word-level embeddings. But the idea was naturally and quickly extended to larger contexts. This was motivated by the fact that a word’s meaning is obviously context-dependent (polysemy). For example, the word “bank” might refer to a financial institution or the side of a river. A word embedding for “bank” that is not context dependent must somehow flatten this distinction. So lack of context is obviously a limitation. Researchers tackled this through a variety of approaches. One approach was to use the hidden states of a bidirectional long short-term memory network (LSTM) as context-specific embeddings as in context2vec: Learning generic context embedding with bidirectional LSTM (Melamud et al., 2016) or Learned in translation: contextualized word vectors (McCann et al., 2017). But perhaps the most noteworthy example of this idea—and one I mention here because it will come up later—was Deep contextualized word representations (Peters et al., 2018) or ELMO. Here, the authors both used a bidirectional LSTM to extract more context-dependent word embeddings and then trained on an objective function that was dependent on the downstream task. This hints at combining pre-trained embeddings with supervised fine-tuning, which we’ll see later. Long-range dependencies By 2013, word- and phrase-level embeddings demonstrably worked. The key to unlocking them was simple methods that scaled on modern hardware. However, the problem with these embeddings is that they were still with respect to a fixed window. It was not immediately obvious how this idea could be extended to longer phrases or sentences or to larger texts. Of course, researchers had tried. For example, (Collobert & Weston, 2008) used the idea of time-delay neural networks (Waibel et al., 1989) to model sentences of variable lengths, but the authors used convolutions that still had a fixed-width window size. The embedding itself, then, was not constructed while accounting for long-range dependencies. So word embeddings, while a beautiful idea, only set the stage for the next big idea in our history: tackling the problem of modeling long-range dependencies without an explicit context window. The key innovation here was sequence-to-sequence models. In a sequence-to-sequence model, a neural network encodes a variable-length input sequence into a fixed-length vector, while a second neural network decodes this fixed-length vector back into a variable-length output sequence. In both Bengio and Mikolov’s papers, the input was an embedding ( c \mathbf{c} c in Equations 14 14 14 and 20 20 20). In a sequence-to-sequence model, this intermediate fixed-length vector is now the word embedding. The precise architectures used for the encoder and decoder can vary, but clearly they should be architectures that support variable-length sequences, such as recurrent neural networks (RNNs) or LTSMs. To me, the most intuitive example of a sequence-to-sequence model is a translation model. The input sequence is a sentence in a source language like English, and the output sequence is a sentence in a target language like Chinese (Figure 5 5 5). And since some of the most important early work in sequence-to-sequence modeling was in neural machine translation (NMT), I’ll often use translation as a default example. However, the more general case is any mapping from one sequence to another. This idea is fairly straightforward; it is analogous to an auto-encoder but for variable-length sequences, and auto-encoders (Bourlard & Kamp, 1988) are nearly as old as back-propagation. However, as we have already seen, even seemingly simple ideas are hard-won. The original work in RNNs and LSTMs goes back to at least the early 1990s, with seminal papers like Finding structure in time (Elman, 1990), Serial order: A parallel distributed processing approach (Jordan, 1997) and Long short-term memory (Hochreiter & Schmidhuber, 1997). By the 2010s, these sequential models were well-known and already used in NLP. See (Mikolov et al., 2010; Sutskever et al., 2011; Graves, 2013) for example. These models were an important bridge, proving that we could train RNNs at scale and overcome the vanishing gradient problem discussed in Learning long-term dependencies with gradient descent is difficult (Bengio et al., 1994). But they were not yet sequence-to-sequence models. To my knowledge, the first paper to propose a full encoder–decoder architecture for NLP was Recurrent continuous translation models (Kalchbrenner & Blunsom, 2013). Here, the authors proposed training two neural networks end-to-end. The decoder was an RNN, inspired by the model in (Mikolov et al., 2010). But somewhat surprisingly, the encoder was not also an RNN. With hindsight, two RNNs feels like the obvious choice, but instead the authors used a convolutional sentence model (CSM). The details don’t really matter here, but this is essentially an NLP model which uses convolutional layers. Why this choice? Well, CSMs were actually developed by the same authors in the same year, in Recurrent convolutional neural networks for discourse compositionality (Kalchbrenner & Blunsom, 2013), and my hypothesis is that this choice just felt obvious to them at the time. So (Kalchbrenner & Blunsom, 2013) was a landmark paper in the sense that it was the first attempt at a sequence-to-sequence model, but with hindsight we can immediately see how to improve it with a better sequential model for the encoder. And that is precisely what happens in two follow up papers. First, in Learning phrase representations using RNN encoder–decoder for statistical machine translation (Cho et al., 2014), the authors propose the first encoder–decoder architecture in which both neural networks were RNNs. And then in Sequence to sequence learning with neural networks (Sutskever et al., 2014), the authors proposed a similar model but using LSTMs, since LSTMs often work better at handling the aforementioned vanishing gradient problem. In this paper, Sutskever makes the connection to Kalchbrenner explicitly: Our work is closely related to Kalchbrenner and Blunsom, who were the first to map the input sentence into a vector and then back to a sentence, although they map sentences to vectors using convolutional neural networks, which lose the ordering of the words. As a nitpick, convolutional neural networks do model local patterns and order, but they lose global order without very large receptive fields. But Sutskever’s point is directionally correct. So even at the time, the academic history we are tracing here was clear. To understand these models in a bit more detail, let’s go through the RNN encoder–decoder in (Cho et al., 2014), using Figure 6 6 6 as a reference. Let X \mathcal{X} X be a variable-length input sequence with length T x T_x Tx​, and let Y \mathcal{Y} Y be a variable-length output sequence with length T y T_y Ty​: X = { x 1 , x 2 , … , x T x } , Y = { y 1 , y 2 , … , y T y } . (25) \begin{aligned} \mathcal{X} &= \{ \mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_{T_x} \}, \\ \mathcal{Y} &= \{ \mathbf{y}_1, \mathbf{y}_2, \dots, \mathbf{y}_{T_y} \}. \end{aligned} \tag{25} XY​={x1​,x2​,…,xTx​​},={y1​,y2​,…,yTy​​}.​(25) Note that ( X , Y ) (\mathcal{X}, \mathcal{Y}) (X,Y) is a single observation pair, but I am suppressing the sample index for ease of notation. Also, I bold each vector in both sequences because they are embedded words. In an RNN, we iteratively compute hidden state variables over T x T_x Tx​ steps, where for the t t t-th step we define a recurrence relation between hidden states as: h t = f enc ( h t − 1 , x t ) . (26) \mathbf{h}_t = f_{\textsf{enc}} \left( \mathbf{h}_{t-1}, \mathbf{x}_t \right). \tag{26} ht​=fenc​(ht−1​,xt​).(26) This might be a little abstract. So concretely, a simple RNN network might instantiate f enc f_{\textsf{enc}} fenc​ as the following nonlinear function of the current word embedding and the previous hidden state: h t = tanh ⁡ ( W h h h t − 1 + W x h x t ) . (27) \mathbf{h}_t = \tanh \left(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t \right). \tag{27} ht​=tanh(Whh​ht−1​+Wxh​xt​).(27) The matrices hopefully have obvious dimensions, and we can initialize the first hidden state vector h 0 \mathbf{h}_0 h0​ however we like, such as a vector of all zeros. This is simply one choice, though. We can imagine many types of choices, such as a vanilla RNN unit or an LSTM unit. The key point is that the hidden state vectors H = { h 1 , h 2 , … , h T x } (28) \mathcal{H} = \{\mathbf{h}_1, \mathbf{h}_2,\dots, \mathbf{h}_{T_x}\} \tag{28} H={h1​,h2​,…,hTx​​}(28) carry forward information from previous words in the sequence via these recurrent connections, much like a hidden Markov model (Baum & Petrie, 1966). A powerful consequence of this model is that RNNs do not limit the size of the input context window. Different input sequences X \mathcal{X} X can be different sizes, unlike in the N N N-gram model or in Bengio’s model (Equation 14 14 14). See Andrej Karpathy’s excellent blog post, The unreasonable effectiveness of recurrent neural networks, for a more detailed presentation of RNNs. Finally, we define the context vector c \mathbf{c} c as some function of the hidden states: c = q ( H ) . (29) \mathbf{c} = q(\mathcal{H}). \tag{29} c=q(H).(29) Notice that c \mathbf{c} c does not have a time index, because it compresses all the temporal information in the input sequence X \mathcal{X} X into a fixed-width vector. The easiest definition of c \mathbf{c} c is simply as the last hidden state vector or c = h T x \mathbf{c} = \mathbf{h}_{T_x} c=hTx​​. This context vector becomes an input to the decoder, another RNN with recurrence relation s t = f dec ( s t − 1 , y t − 1 , c ) , (30) \mathbf{s}_t = f_{\textsf{dec}} \left( \mathbf{s}_{t-1}, \mathbf{y}_{t-1}, \mathbf{c} \right), \tag{30} st​=fdec​(st−1​,yt−1​,c),(30) and hidden states S = { s 1 , s 2 , … , s T y } . (31) \mathcal{S} = \{\mathbf{s}_1, \mathbf{s}_2,\dots, \mathbf{s}_{T_y}\}. \tag{31} S={s1​,s2​,…,sTy​​}.(31) The decoder then outputs the sequence Y \mathcal{Y} Y, one word at a time. The typical objective of a sequence-to-sequence model is again the autoregressive objective of next-word prediction: maximize a log likelihood, in which each conditional probability is modeled via the decoder RNN: log ⁡ p ( Y ) = ∑ t = 1 T y log ⁡ p ( y t ∣ y 1 : t − 1 ) = ∑ t = 1 T y log ⁡ f dec ( s t − 1 , y t − 1 , c ) . (32) \log p(\mathcal{Y}) = \sum_{t=1}^{T_y} \log p(\mathbf{y}_t \mid \mathbf{y}_{1:t-1}) = \sum_{t=1}^{T_y} \log f_{\textsf{dec}}(\mathbf{s}_{t-1}, \mathbf{y}_{t-1} , \mathbf{c}). \tag{32} logp(Y)=t=1∑Ty​​logp(yt​∣y1:t−1​)=t=1∑Ty​​logfdec​(st−1​,yt−1​,c).(32) Again, this might be a bit abstract. So for example, one possible instantiation of g g g is as a linear transformation of the input variables: f dec ( s t − 1 , y t − 1 , c ) = W z s s t + W z y y t − 1 + W z c c . (33) f_{\textsf{dec}}(\mathbf{s}_{t-1}, \mathbf{y}_{t-1}, \mathbf{c}) = \mathbf{W}_{zs} \mathbf{s}_t + \mathbf{W}_{zy} \mathbf{y}_{t-1} + \mathbf{W}_{zc} \mathbf{c}. \tag{33} fdec​(st−1​,yt−1​,c)=Wzs​st​+Wzy​yt−1​+Wzc​c.(33) Of course, this is just one choice. Then all the model weights are learned end-to-end by optimizing this log likelihood (Equation 32 32 32). In this way, we can convert a variable-length input X \mathcal{X} X into a variable-length output Y \mathcal{Y} Y. This RNN encoder–decoder framework is powerful, since many problems in NLP can be framed in this way. For example, text summarization, machine translation, and agentic conversation can all be framed as a sequence-to-sequence modeling challenge. To be clear, other researchers around this time had attempted other approaches to handling variable-length sequences, such as the recursive neural tensor network in Recursive deep models for semantic compositionality over a sentiment treebank (Socher et al., 2013). But the RNN encoder–decoder would become the de facto framework of choice for a large range of NLP tasks. As an aside, sometimes these models are call sequence transduction models or transduction models or even just transducers. My understanding is that “transduction” here just means converting one sequence into another by learning a conditional distribution p θ ( y 1 : T ∣ x 1 : S ) p_{\theta}(\mathbf{y}_{1:T} \mid \mathbf{x}_{1:S}) pθ​(y1:T​∣x1:S​). In this context, “transduction” does not have the sense that Vladimir Vapnik gave it. In Vapnik’s definition, transduction loosely means classification of a specific example rather than a general rule for classifying future examples (Gammerman et al., 2013). But this is not the sense which people mean when they refer to models like the transformer as a “transducer”. Adaptive context In my mind, Kalchbrenner, Cho, and Sutskever’s three papers (Kalchbrenner & Blunsom, 2013; Cho et al., 2014; Sutskever et al., 2014) were the foundations of sequence-to-sequence modeling, and many other papers have built around and off this core idea. But the key point for us here is that these three papers make the same logic choice: they lift the idea of a fixed-length embedding for words or phrases into the context vector c \mathbf{c} c of a sequential model, such that the models can now support variable-length inputs and outputs and long-range dependencies in each. However, a problem with this approach was that long-range dependencies got “lost” in this context vector. For example, imagine we had a very long English language text that we wanted to translate into Chinese. Even if our encoder LSTM was good at capturing long-range dependencies in the English sentence, it would be forced to compress that information into a much shorter, fixed-width vector with no temporal structure that would then be fed into the decoder. This effect was observed by Cho et al in On the properties of neural machine translation: encoder–decoder approaches (Cho et al., 2014). In this paper, the authors write: Our analysis shows that the performance of the neural machine translation model degrades quickly as the length of a source sentence increases. And later: The most obvious explanatory hypothesis is that the fixed-length vector representation does not have enough capacity to encode a long sentence with complicated structure and meaning. The authors test this hypothesis through a variety of experiments. For example, in one experiment, they report the BLEU score for an RNN encoder–decoder as a function of sequence length, and they show that the model’s performance degrades as the sentences become longer. So the RNN decoder–encoder was promising, but the fixed-width context vector was a bottleneck on modeling long-range dependencies. Then in 2014, a seminal paper was published that addressed this problem, Neural machine translation by jointly learning to align and translate (Bahdanau et al., 2014). The main invention of this paper was to use the well-known attention mechanism to attend to this context vector. However, the authors barely use the word “attention” in the paper. Instead, they seem to conceptualize it more as a search problem: In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder–decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. I say this paper is “seminal” because, at least to my knowledge, it was really the first paper to use a differentiable attention layer in the rapidly-growing field of NMT. To be clear, the attention mechanism was already known and used outside of NLP. For example, see Learning to combine foveal glimpses with a third-order Boltzmann machine (Larochelle & Hinton, 2010), Learning where to attend with deep architectures for image tracking (Denil et al., 2012), or Recurrent models of visual attention (Mnih et al., 2014). These were all papers that were published between 2010 and 2014 and that applied an attention mechanism to a neural network computer vision system. However, to my knowledge, Bahdanau was the first paper to successfully use attention in NLP. To quote Effective approaches to attention-based neural machine translation (Luong et al., 2015): In the context of NMT, Bahdanau et al… has successfully applied such attentional mechanism to jointly translate and align words. To the best of our knowledge, there has not been any other work exploring the use of attention-based architectures for NMT. All that said, “jointly align and translate” is pretty vague, so let’s get technical. Bahdanau’s solution to this bottleneck was to allow each hidden state vector in the decoder to pay attention to possibly all the hidden state vectors in the encoder. What do I mean by “pay attention to”? Here, each decoder hidden state variable s i \mathbf{s}_i si​ depends not only on the previous hidden state and previous word but also on its own context vector, which is a weighted combination of the encoder’s hidden states! s i = f dec ( s i − 1 , y i − 1 , c i ) , c i = ∑ j = 1 T x α i j h j . (34) \begin{aligned} \mathbf{s}_i & = f_{\textsf{dec}}(\mathbf{s}_{i-1}, \mathbf{y}_{i-1}, \mathbf{c}_i), \\ \mathbf{c}_i &= \sum_{j=1}^{T_x} \alpha_{ij} \mathbf{h}_j. \end{aligned} \tag{34} si​ci​​=fdec​(si−1​,yi−1​,ci​),=j=1∑Tx​​αij​hj​.​(34) This is the main idea of the paper. Each decoder hidden state s i \mathbf{s}_i si​ has access to all the hidden states in the encoder via this context vector c i \mathbf{c}_i ci​ (Figure 7 7 7). We can finally define the attention mechanism! Here, it is the weighted sum of hidden state vectors, as this allows each s i \mathbf{s}_i si​ to attend to different parts of the input sequence through its hidden state. Each weight α i j \alpha_{ij} αij​ is a linear function of the previous decoder hidden state s i − 1 \mathbf{s}_{i-1} si−1​ and the current decoder hidden state h j \mathbf{h}_j hj​: α i j : = exp ⁡ ( e i j ) ∑ k = 1 T x exp ⁡ ( e i k ) , e i j : = v a ⊤ z i j , z i j : = tanh ⁡ ( W a s i − 1 + U a h j ) . (35) \begin{aligned} \alpha_{ij} &:= \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}, \\ e_{ij} &:= \mathbf{v}_a^{\top } \mathbf{z}_{ij}, \\ \mathbf{z}_{ij} &:= \tanh\left( \mathbf{W}_a \mathbf{s}_{i-1} + \mathbf{U}_a \mathbf{h}_j \right). \end{aligned} \tag{35} αij​eij​zij​​:=∑k=1Tx​​exp(eik​)exp(eij​)​,:=va⊤​zij​,:=tanh(Wa​si−1​+Ua​hj​).​(35) Let’s call α i \boldsymbol{\alpha}_i αi​ the an alignment vector, which we infer one per step at a time during the decoding process. So z i j \mathbf{z}_{ij} zij​ can be viewed as a shared hidden state, capturing nonlinear information about both the input and output sequence. Importantly, there is one such vector for each input-output pair. And for a given decoder hidden state, the model can up or downweight the relationship to h j \mathbf{h}_j hj​ via the parameters v a \mathbf{v}_a va​. The neural network learns all these model parameters end-to-end via back-propagation, maximizing the log likelihood in Equation 32 32 32. So that’s it. As I understand it, (Bahdanau et al., 2014) was really the first paper to use attention in neural machine translation and probably the most successful use of attention in NLP at the time. The method worked surprisingly well. To quote the paper’s conclusion: Perhaps more importantly, the proposed approach achieved a translation performance comparable to the existing phrase-based statistical machine translation. It is a striking result, considering that the proposed architecture, or the whole family of neural machine translation, has only been proposed as recently as this year. As an aside, they actually use a bidirectional RNN for the encoder and then concatenated the forward and backward hidden states. But I don’t think that adds much to our story or to intuition, and it would muddy Figure 7 7 7. The key point is that it was the attention mechanism that allowed for the long-range dependencies encoded by the RNN to be captured through an adaptive context vector. Hopefully, we can now see why the paper uses the words “align and translate”. Here, alignment really means allowing the model to uncover which parts of the input sequence matter to each part of the output sequence—and it does this via the attention mechanism. Finally, while writing this blog post, I came across this incredible comment by Edward Grefenstette, published on 3 May 2014: By and large, the case for deep learning in language hasn’t been fully made. It works well for vision and speech, but that doesn’t entail that it would carry to semantics. Some excellent shallow models without non-linearities, like the Mnih and Hinton log-bilinear models, are excellent and can be trained very quickly. It’s a problem with much “deep learning” work in NLP these days that shallow baselines are never considered or compared to. Deep learning is fascinating and will certainly have an impact in NLP, but don’t rush to believe that it’s the best solution for your NLP problems. I love this comment because it is a time-capsule, perfectly capturing how experts in the field felt about neural networks at the time. (Note that Grefenstette has published papers with other researchers in this story, such as Kalchbrenner and Graves.) So even around the time that Bahdanau et al were publishing groundbreaking work on RNN encoder–decoders with attention, deep learning had still not fully proven itself to the community. Types of attention The attentive reader might be wondering: wasn’t the argument around log-linear models that they were simple and therefore scalable? But Bahdanau’s RNN encoder–decoder with attention seems anything but simple. So on some level, yes, Bahdanau’s model was a step backwards in terms of complexity. But on another level, it was a proof-of-concept that the attention mechanism worked. (Also, Moore’s law.) So researchers quickly built on Bahdanau by studying simpler models and simpler types of attention. Perhaps the most important paper to directly build on Bahdanau’s model was (Luong et al., 2015). In this paper, the authors simplified the model used by Bahdanau, proposed several alternative forms of attention, and showed that an ensemble of attention-based methods produced state-of-the-art results on neural machine translation problems. To be clear, Bahdanau had shown that attention worked and that it seemed to address problems in translating longer sentences, but it did not demonstrably beat the state-of-the-art. Luong’s results more directly suggested that attention might be the way forward. So before we get to the transformer, let’s understand the attention mechanism better through the lens of this paper. The first dimension along which we can define attention is local versus global attention. For example, in the attention mechanism in an RNN encoder–decoder, the conceptual lynchpin is that at each time step i ∈ { 1 , … , T y } i \in \{1, \dots, T_y\} i∈{1,…,Ty​} in the decoding phase, we construct a context vector c i \mathbf{c}_i ci​ which summarizes information from the source sentence via the encoder’s hidden states: c i = ∑ j = a b α i j h j . (36) \mathbf{c}_i = \sum_{j=a}^{b} \alpha_{ij} \mathbf{h}_j. \tag{36} ci​=j=a∑b​αij​hj​.(36) But now I don’t precisely define the limits of the sum, a a a and b b b. If a = 1 a=1 a=1 and b = T x b=T_x b=Tx​, then the context vector is constructed by considering all the hidden states of the source sentence. This is what Luong calls global attention (Figure 8 8 8, left), since each word in the target sentence has access to information about all the words in the source sentence. But we could also define a a a and b b b such that they form a window around the decoder’s hidden state or model the left-to-right structure of many natural languages. This is what Luong calls local attention (Figure 8 8 8, middle). So these are two ways in which we can construct the context vector c i \mathbf{c}_i ci​. The second dimension along which we can define attention is how we define the alignment weights α i \boldsymbol{\alpha}_i αi​. For example, the simplest choice is simply that α i \boldsymbol{\alpha}_i αi​ is a one-hot vector, such that c i \mathbf{c}_i ci​ selects a single encoder hidden state vector h k \mathbf{h}_k hk​ to use in the i i i-th decoding step. This would be hard- rather than soft-search. But more generally, we can write these alignment weights as the unnormalized output of a score function. Using the notation from Equation 35 35 35 above, we can write this as: e i j : = score ( h j , s i − 1 ) . (37) e_{ij} := \text{score}(\mathbf{h}_j, \mathbf{s}_{i-1}). \tag{37} eij​:=score(hj​,si−1​).(37) And in Luong, the authors explore three main scoring functions. These are dot-product attention, general attention, and additive attention, defined as: e i j = score ( h j , s i − 1 ) = { h j ⊤ s i − 1 dot, h j ⊤ W a s i − 1 general, v a ⊤ tanh ⁡ ( W a h j + U a s i − 1 ) additive (Bahdanau). (38) e_{ij} = \text{score}(\mathbf{h}_j, \mathbf{s}_{i-1}) = \begin{cases} \mathbf{h}_j^{\top} \mathbf{s}_{i-1} & \text{dot,} \\ \mathbf{h}_j^{\top} \mathbf{W}_a \mathbf{s}_{i-1} & \text{general,} \\ \mathbf{v}_a^{\top } \tanh \left( \mathbf{W}_a \mathbf{h}_j + \mathbf{U}_a \mathbf{s}_{i-1} \right) & \text{additive (Bahdanau).} \end{cases} \tag{38} eij​=score(hj​,si−1​)=⎩⎪⎪⎨⎪⎪⎧​hj⊤​si−1​hj⊤​Wa​si−1​va⊤​tanh(Wa​hj​+Ua​si−1​)​dot,general,additive (Bahdanau).​(38) Of course, you can imagine many other score functions. My own view is that it’s too difficult here to reason about which form of attention is better in some theoretical sense. Which form works best is an empirical result. In (Luong et al., 2015), the empirical results were mixed in the sense that all three score functions worked well. In fact, the results weren’t even strong enough for the authors to claim that attention-based methods were demonstrably better. This was their conclusion: Our analysis shows that attention-based NMT models are superior to non-attentional ones in many cases, for example in translating names and handling long sentences. So by late 2015, just two years before the transformer, attention was just becoming popular in NMT but was not yet the de facto modeling choice. That said, obviously this will change, and when it does, there will be a clear winner amongst the choices above, and that winner is dot-product attention. Dot-product attention is the variant used by the transformer, and thankfully, in my mind it is the most intuitive since the dot product is a standard way to measure the similarity between two vectors. So we can interpret the dot-product score function as measuring the similarity between the encoder and decoder hidden states. The third and final dimension along which we can define attention is through the variables of interest. In order to understand what I mean, we can no longer refer to attention in terms of hidden states of RNNs. We need more general terminology. In the literature, attention is often viewed through the lens of information retrieval. In this literature, a query is what you are asking for; a key is what you can search through; and a value is what you can return. Let me give an example (Figure 9 9 9). Imagine I type some text into a search bar: “indian food near me”. This text is the query. Now imagine the search engine runs that query against a bunch of metadata associated with different restaurants. For example, restaurant descriptions, keywords, reviews, ratings, and distances from my location. These metadata are the keys. So the query is “run against” the keys. Finally, the thing returned are candidate restaurants. These are values. In the language of information retrieval, we can describe the attention mechanism as a kind of soft-search, since it can return a linear combination of the values. As you may recall, this is precisely how Bahdanau described their model in the quote above. So in Bahdanau’s RNN encoder–decoder, the decoder’s hidden states s i \mathbf{s}_i si​ are the queries, since for each hidden state s i \mathbf{s}_i si​ we want to search through the source sentence. The encoder’s hidden states h j \mathbf{h}_j hj​ are the keys, since these are the metadata associated with the source sentence that we can search through. Finally, the encoder’s hidden states are also the values, since the context vector c i \mathbf{c}_i ci​ is a weighted combination of these encoder hidden states. This language is useful because it disambiguates the attention mechanism from a specific choice of model and even from which variables in that model are being used for what. Now that we understand this terminology, we can express ourselves more cleanly and abstractly. And with this terminology, it becomes clear that the keys, queries, and values need not be different objects in our model at all! In fact, queries, keys, and values can all be taken from the same set. For example, imagine we have a model with a hidden state h \mathbf{h} h. This is not necessarily the hidden state of an RNN or even a sequential model. We could define a kind of attention such that the queries ( q \mathbf{q} q), keys ( k \mathbf{k} k), and values ( v \mathbf{v} v) are all functions of this hidden state: q i : = f q ( h i ) , k j : = f k ( h j ) , v j : = f v ( h j ) , α i j = softmax ( score ( q i , k j ) ) , ∑ j α i j = 1 , c i = ∑ j α i j v j . (39) \begin{aligned} \mathbf{q}_i &:= f_q(\mathbf{h}_i), \\ \mathbf{k}_j &:= f_k(\mathbf{h}_j), \\ \mathbf{v}_j &:= f_v(\mathbf{h}_j), \\ \alpha_{ij} &= \text{softmax}(\text{score}(\mathbf{q}_i, \mathbf{k}_j)), \qquad \sum_{j} \alpha_{ij} = 1, \\ \mathbf{c}_i &= \sum_j \alpha_{ij} \mathbf{v}_j. \end{aligned} \tag{39} qi​kj​vj​αij​ci​​:=fq​(hi​),:=fk​(hj​),:=fv​(hj​),=softmax(score(qi​,kj​)),j∑​αij​=1,=j∑​αij​vj​.​(39) This is obviously different from the attention mechanism in Bahdanau. In Bahdanau, the authors use cross-attention, which is attention where the queries come from one set and the keys and values come from a different set. As you can imagine, typically the keys and values come from the same set, although they might have their own maps or projections such that they are correlated but not identical. For example, we might run a query against restaurants (keys) and also return restaurants (values). However, self-attention is when the queries, keys, and values all come from the same set of variables! To continue abusing our running example, we essentially compute the similarity between restaurants of interest and restaurants we have data about, and then use those weights to return a weighted combination of restaurants! To my knowledge, the first paper to use self-attention in NLP was Long short-term memory-networks for machine reading (Cheng et al., 2016). This model is a bit complicated, and I don’t think it’s that important to understand here. The key point is only to grok that attention does not have to be cross-attention as in Bahdanau. Instead, we can have a sequence attend to itself to decide what parts of the sequence matter—or self-attention! This is how this idea was described in the paper: A remaining practical bottleneck for RNNs is memory compression (Bahdanau et al., 2014): since the inputs are recursively combined into a single memory representation which is typically too small in terms of parameters, it becomes difficult to accurately memorize sequences (Zaremba & Sutskever, 2014). In the encoder-decoder architecture, this problem can be sidestepped with an attention mechanism which learns soft alignments between the decoding states and the encoded memories (Bahdanau et al., 2014). In our model, memory and attention are added within a sequence encoder allowing the network to uncover lexical relations between tokens. The important phrase here is “within a sequence encoder”. Here, the attention is not applied across the encoder and decoder but rather is applied as intra- or self-attention within the encoder. So circa 2017, attention was being studied in its many forms: local versus global, additive versus multiplicative, and cross versus self. And it was being more widely used in NLP, with papers like A structured self-attentive sentence embedding (Lin et al., 2017) and Bidirectional attention flow for machine comprehension (Seo et al., 2016). That said, I do not think any specific form was clearly the dominant one. Rather, each showed promise in its own way. For example, in March 2017, Google Brain published Massive exploration of neural machine translation architectures (Britz et al., 2017). This was published just months before the transformer would be published, and even here, attention is only a minor player. In that paper’s conclusions, the authors list six main results, and the only one about attention is a single sentence: Parameterized additive attention yielded the overall best results. Notice that additive attention is not even the form of attention used by the transformer! So at least as best as I understand it, attention was well-understood and widely-studied in 2017, but it was by no means considered the main ingredient or the next logical step. Many researchers were still pushing the limits of training RNNs at scale, rather than trying other approaches. See Exploring the limits of language modeling (Jozefowicz et al., 2016) for example. However, in June 2017, all that was about to change. The transformer’s time had come. The transformer In 2017, researchers at Google Brain published Attention is all you need (Vaswani et al., 2017), which is the original paper introducing the transformer architecture. This was their proposal, which I hope now makes sense given the context so far: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. The authors acknowledge that the sequence-to-sequence framework with neural networks was state-of-the-art, and they specifically call out the RNN encoder–decoder architecture with attention from Bahdanau, Luong, and others. Their proposal is simple: keep the encoder–decoder framework but replace everything else with attention. How might someone have come to this idea at the time? Why would it be a good idea to try? Their observation is that the sequential nature of RNNs inhibits training these models at scale: Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states h t h_t ht​, as a function of the previous hidden state h t − 1 h_{t−1} ht−1​ and the input for position t t t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Their proposal is to use attention rather than RNNs to uncover dependencies within the input and output sequences. This is a good idea to try not because attention is obviously better than recurrence per se. It’s that attention is parallelizable! They write: The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs. We have seen this before. Recall how the unlock for word embeddings in (Mikolov et al., 2013; Mikolov et al., 2013) was simplifying the models and focusing on scale. But then the RNN encoder–decoder architecture in (Bahdanau et al., 2014) with attention took us backwards in terms of model complexity. So the transformer is a similar story: take the best modeling ideas, strip them down, and train the simplified model at scale. That’s it. Properly understood in context, the transformer is a modest conceptual leap from the existing literature. My point is not that the transformer is “obvious” in the sense that it is not an impressive invention. My point is to demystify the research product by underscoring the process. In context, the transformer should make sense as something someone might have tried in 2017. The model architecture might look intimidating, but it is pretty straightforward when viewed in the right context (Figure 10 10 10). At a high level, the transformer is an encoder–decoder, with two big kinds of attention. First, we have cross-attention between the outputs of the encoder and the inputs to the decoder. This is completely analogous to the cross-attention in Bahdanau and others. But then we also have self-attention within the decoder and encoder. This completely replaces the recurrence relations of RNNs. Finally, the model uses something called positional encoding, which I’ll define shortly, to handle the fact that attention is not naturally sequential a la an RNN. Everything else is details. For example, the transformer also uses layer normalization (Ba et al., 2016) and residual connections (He et al., 2016), but these are not unique or novel contributions. Even multi-head attention is not conceptually hard. So understood in context, the transformer is pretty straightforward. Let’s go through the main bits in detail. First, positional encoding. A key challenge for the attention mechanism is that it does not inherently capture sequential structure. Thus, the relative positions of words in a sequence can be easily lost. In Vaswani, the authors propose attaching vectors of numbers to the inputs to capture this position-dependent information. The precise functional form of these numbers doesn’t really matter to us. The point is that we’re encoding the position of each word so that we can still model the sequential structure of natural language. After adding position-dependent information, the transformer encodes the input sequence. But rather than passing the data through an RNN, it passes the data through multi-head attention layers. We’ll discuss “multi-head” in a moment, but the basic attention mechanism is what the authors call scaled-dot product attention. Let’s define it. Let Q ∈ R M × D k \mathbf{Q} \in \mathbb{R}^{M \times D_k} Q∈RM×Dk​ be a matrix of queries, let K ∈ R N × D k \mathbf{K} \in \mathbb{R}^{N \times D_k} K∈RN×Dk​ be a matrix of keys, and let V ∈ R N × D v \mathbf{V} \in \mathbb{R}^{N \times D_v} V∈RN×Dv​ be a matrix of values. Then scaled dot-product attention is: attention ( Q , K , V ) = softmax ( Q K ⊤ D k ) V . (40) \text{attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^{\top}}{\sqrt{D_k}} \right) \mathbf{V}. \tag{40} attention(Q,K,V)=softmax(Dk​ ​QK⊤​)V.(40) When I first read Vaswani, I had not yet read Bahdanau or Luong, and thus I was completely confused by Equation 40 40 40. It was not at all obvious what any of these values represented or why any of this machinery worked. And the paper itself gave a pretty opaque explanation: An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. Without context, this explanation is not very helpful. However, armed with a better understanding of attention, we can make sense of this. In the cross-attention between the encoder and decoder, the queries are analogous to the hidden states of the RNN decoder, while the keys and values are analogous to the hidden states of the RNN encoder. And if we remove the sample dimension (so let N = 1 N=1 N=1), we can rewrite Equation 40 40 40 in a way that looks like the types of attention in Equation 38 38 38: score ( q i , k j ) = e i j = q i ⊤ k j D k , α i j = exp ⁡ ( e i j ) ∑ k = 1 D v exp ⁡ ( e i k ) , attention ( α i , v i ) = ∑ k = 1 D v α i k v i . (41) \begin{aligned} \text{score}(\mathbf{q}_i, \mathbf{k}_j) &= e_{ij} = \frac{\mathbf{q}_i^{\top} \mathbf{k}_j}{\sqrt{D_k}}, \\ \alpha_{ij} &= \frac{\exp(e_{ij})}{\sum_{k=1}^{D_v} \exp(e_{ik})}, \\ \text{attention}(\boldsymbol{\alpha}_i, \mathbf{v}_i) &= \sum_{k=1}^{D_v} \alpha_{ik} \mathbf{v}_i. \end{aligned} \tag{41} score(qi​,kj​)αij​attention(αi​,vi​)​=eij​=Dk​ ​qi⊤​kj​​,=∑k=1Dv​​exp(eik​)exp(eij​)​,=k=1∑Dv​​αik​vi​.​(41) So this is identical to the multiplicative or dot-product attention proposed in Luong (Equation 38 38 38), modulo a scaling factor D k \sqrt{D_k} Dk​ ​. In Equation 40 40 40, we are just packaging it into a matrix form so that we can compute this attention over many samples at once. In other words, this is a highly parallelizable version of the dot-product attention. I think one of the reasons the transformer can be confusing is the use of two types of attention and the generic language of queries, keys, and values, whose definitions change depending on the type of attention. In the encoder, the transformer uses self-attention. So the query represents the current vector in the input sequence, while the keys and values are all the other vectors in the input sequence. And in the decoder, the query represents the current vector in the output sequence, while the keys and values are all the other vectors in the output sequence—modulo masking, which I’ll mention in a moment. Finally, the attention between the encoder and decoder (in the paper, Vaswani calls this “encoder–decoder attention”), the query is the current vector in the decoder output (analogous to s i \mathbf{s}_i si​ in the RNN encoder–decoder), while the keys and values are the encoder’s hidden outputs (analogous to H \mathcal{H} H in the RNN encoder–decoder). Note that “masked” in “masked multi-head self-attention” just refers to a masking out of words in the decoder’s self-attention mechanism. This is because attention has no inherent sequential structure a la RNNs. So we have to enforce this by masking regions of the output sequence (Figure 8 8 8, right). This allows the transformer to be trained in the standard autoregressive framework we have discussed since (Bengio et al., 2003). Finally, the transformer learns multiple sets of parameters associated with the attention mechanism at once. This is what the paper calls multi-head attention. Instead of having a single attention function, we can run multiple attention functions in parallel, say A A A times. By way of analogy, recall that in the RNN encoder–decoder, we had the following attention parameters (Equation 35 35 35): { W a , U a , v a } . (42) \{\mathbf{W}_a, \mathbf{U}_a, \mathbf{v}_a \}. \tag{42} {Wa​,Ua​,va​}.(42) In Bahdanau (Equation 35 35 35) the subscript a a a just denotes that these are attention-related weights. It is not actually indexing into multiple such weights (that is, A = 1 A=1 A=1). But we could do that. We could say that a a a is indexing into different parameters, a ∈ { 1 , 2 , … , A } a \in \{1, 2,\dots, A\} a∈{1,2,…,A}. This would have made Bahdanau’s model slower to train, but it would have allowed for multiple cross-attention mechanisms to be learned at once. In Bahdanau, they don’t actually do this, likely because it’s too expensive! The precise details are different in Vaswani, but this is all multi-head attention is in theory. It is multiple parallel attention mechanisms. So that’s it. That’s the transformer. The results were impressive. To be clear, it was not an AlexNet moment, but the results were clearly better than benchmarks and more importantly, the model was way more efficient. For example, one of the benchmarks in Vaswani is the ConvS2S Ensemble from Convolutional sequence to sequence learning (Gehring et al., 2017). The idea of this paper is similar to the transformer: train a bigger sequence-to-sequence model by eschewing recurrent connections in favor of parallelizable convolutional layers. In both English-to-German and English-to-French translation, the transformer beats this model in BLEU score. But more importantly, it is more efficient. For example, according to Vaswani, the ConvS2S Ensemble required 1.2 × 1 0 21 1.2 \times 10^{21} 1.2×1021 flops to train their English-to-French model, whereas the transformer required 3.3 × 1 0 18 3.3 \times 10^{18} 3.3×1018 flops. So the transformer had comparable results with a 360x reduction in flops! In my mind, this the real insight. It is not that attention is absolutely the best way to model the problem. Rather, the transformer is on the Pareto frontier between modeling the problem well enough and being scalable enough. To see the transformer in code, see Sasha Rush’s excellent The annotated transformer. Generative pre-training The transformer was a revolutionary architecture, and explicitly designed to scale. However, in reality the original model was tiny by today’s standards. The biggest variant only had 2.13 million parameters, and the largest dataset it was trained on, the WMT 2014 English–French datasets, only had 36 million sentences. But the paper proved that the transformer worked well as a generic transduction model. However, despite the paper’s name, the transformer architecture was not enough. Researchers also needed advancements in how these models were trained in order to make the commodity LLMs most people interact with today. To simplify the discussion, I’ll focus on training for OpenAI’s GPT series. My understanding is that OpenAI made a lot of the big contributions here, and so their papers are good landmarks to follow. Loosely, the three training stages they discuss in their GPT papers are generative pre-training, discriminative fine-tuning, and reinforcement learning with human feedback. Let’s work through the first two in detail here and the last one in detail in the next section. In 2018, roughly a year after the transformer was published, OpenAI published Improving language understanding by generative pre-training (Radford et al., 2018). The main idea of the paper is to pre-train a transformer with as much unlabeled data as possible before fine-tuning it with task-specific supervised training. In the paper, the authors call the first step generative pre-training and the second step discriminitive fine-tuning. (The words “generative” and “discriminitive” have a long history in machine learning; see (Ng & Jordan, 2001) for a discussion.) As the OpenAI paper title suggests, the key focus was on generative pre-training. Supervised learning obviously matters, but the idea was that one could use unsupervised training at scale to build a base model and then use supervised learning to train more task-specific downstream models. Let’s look at generative-pretraining in a bit more detail. Since we do not have labels, we need some way to formalize the problem. In generative pre-training, the objective is next-word prediction as in the autoregressive framework. In other words, the objective is maximum likelihood estimation on Equation 1 1 1: L GPT ( Θ ) = ∑ t = 1 T log ⁡ p Θ ( w t ∣ w t − N : t − 1 ) . (43) L_{\textsf{GPT}}(\boldsymbol{\Theta}) = \sum_{t=1}^T \log p_{\boldsymbol{\Theta}}\left(w_t \mid w_{t-N:t-1}\right). \tag{43} LGPT​(Θ)=t=1∑T​logpΘ​(wt​∣wt−N:t−1​).(43) As we saw around Equation 12 12 12, maximum likelihood estimation here is equivalent to minimizing the cross-entropy loss between our model’s prediction of w t w_t wt​ and the ground truth. So this whole process is unsupervised, and we can train our model on lots and lots and lots of data. It’s worth observing that Equation 43 43 43 is only one generative pre-training objective function, and it has limitations. In particular, note that the autoregressive framework means that the model is pre-trained “left to right” and thus limits the set of suitable downstream tasks. To address this limitation, in 2019, Google AI published BERT: Pre-training of deep bidirectional transformers for language understanding (Devlin et al., 2019). Here, the authors propose a pre-training objective that learns bidirectional representations. Rather than pre-training using the autoregressive framework, they pre-train using a “masked language model”, which randomly masks some of the tokens to predict, without assuming a left-to-right relationship. Quoting that paper: Unlike left-to-right language model pre-training, the [masked language model] objective enables the representation to fuse the left and the right context, which allows us pre-train a deep bidirectional Transformer. More formally, let M ⊆ { 1 , 2 , … , T } \mathcal{M} \subseteq \{1,2,\dots,T\} M⊆{1,2,…,T} be a mask denoting positions in the input sequence w 1 : T w_{1:T} w1:T​, and let ¬ M eg \mathcal{M} ¬M denote all indices that are not in M \mathcal{M} M. The denoising objective is to maximize L MLM ( Θ ) = ∑ i ∈ M log ⁡ p Θ ( w i ∣ w ¬ M ) . (44) L_{\textsf{MLM}}(\boldsymbol{\Theta}) = \sum_{i \in \mathcal{M}} \log p_{\boldsymbol{\Theta}}\left(w_i \mid w_{ eg \mathcal{M}} \right). \tag{44} LMLM​(Θ)=i∈M∑​logpΘ​(wi​∣w¬M​).(44) This idea was inspired by the Cloze test (Taylor, 1953), and the idea was that this bidirectional transformer can then be fine-tuned on a much wider range of downstream tasks. That said, my understanding is that generative pre-training is fairly standard. The left-to-right assumption is simple and matches natural language, coding, and so forth. But I am not confident about what is used in absolutely state-of-the-art foundation models right now. Either way, neither objective function is enough. For example, consider a conversational agent built on top of a large language model. Now imagine the user prompts an LLM with the following question: “I am having trouble getting a date. Any advice?” If the LLM is only trained on next-word prediction, a plausible response might be: “You’ll never find true love!” From the perspective of the distribution of English words on the internet, this is not an unreasonable response. But it is not helpful and hopefully not true. In other words, next-word prediction is obviously not enough for most meaningful tasks that leverage LLMs. So the second step in training is discriminative fine-tuning. “Discriminative fine-tuning” is just a fancy way of saying supervised learning on specific tasks: L DFT ( θ ) = ∑ y , x 1 : T log ⁡ p θ ( y ∣ x 1 : T ) . (45) L_{\textsf{DFT}}(\boldsymbol{\theta}) = \sum_{y, x_{1:T}} \log p_{\boldsymbol{\theta}}\left(y \mid x_{1:T} \right). \tag{45} LDFT​(θ)=y,x1:T​∑​logpθ​(y∣x1:T​).(45) Here, I am using standard notation for supervised learning ( x , y ) (x, y) (x,y), rather than the notation in this post. There are some possible subtleties here. For example, in the GPT-1 paper, they optimize a weighted objective function to balance between generative pre-training and discriminative fine-tuning: L final = L DFT + λ L GPT . (46) L_{\textsf{final}} = L_{\textsf{DFT}} + \lambda \, L_{\textsf{GPT}}. \tag{46} Lfinal​=LDFT​+λLGPT​.(46) This ensures that during fine-tuning, the model does not unlearn parameters that are good for next-word prediction. In the process of trying to fine-tune LLMs, researchers have built ever more task-specific datasets to tackle problems like question-and-answering (Reddy et al., 2019), text summarization (Nallapati et al., 2016), commonsense inference (Zellers et al., 2019), code generation (Chen et al., 2021), broader discourse context (Paperno et al., 2016), and grade school math (Cobbe et al., 2021). A pre-trained LLM can be fine-tuned in a dizzying number of ways. I have two caveats to the above presentation. First, I want to emphasize that this two-step training procedure was not a conceptual leap for researchers. At the time, researchers were already training models with pre-trained word embeddings, and even before this, this two-step training procedure was both understood and used in practice. For examples, see (Collobert & Weston, 2008; Ramachandran et al., 2016; Hinton et al., 2012). Furthermore, researchers knew to use both pre-trained word embeddings and to even have task-specific objectives when training their word embeddings. Remember ELMO? The earliest reference I have found to this idea of pre-training—I am sure there are earlier ones—is from the 2006 paper Greedy layer-wise training of deep networks (Bengio et al., 2006). Here, the authors write: We hypothesize that three aspects of this strategy are particularly important: first, pre-training one layer at a time in a greedy way; second, using unsupervised learning at each layer in order to preserve information from the input; and finally, fine-tuning the whole network with respect to the ultimate criterion of interest. In these examples above, it’s clear the authors recognize that one can pre-train a model with unsupervised learning and then fine-tune it with supervised learning. So even in the GPT paper, the novel contribution is not generative pre-training per se, but only applying it to language modeling at an unprecedented scale. My second caveat is that while discriminative fine-tuning is used in commodity LLMs that many people interact with, the early GPT models were remarkable in part because they did not need fine-tuning! For example, as their titles suggest, the GPT-2 paper Language models are unsupervised multitask learners (Radford et al., 2019) and the GPT-3 paper Language models are few-shot learners (Brown et al., 2020) both focus on massively pre-trained transformers that excel in the zero-shot (Palatucci et al., 2009) and few-shot settings, on a variety of tasks like reading comprehension, summarization, and translation. For example, in the GPT-3 paper, the authors are explicit: For all tasks, GPT-3 is applied without any gradient updates or fine-tuning. That said, many related research projects did fine-tune these models, and the GPT-4 technical report (Achiam et al., 2023) does discuss post-training alignment, which we’ll discuss next. So while each LLM may be trained in slightly different ways, I am fairly confident most foundation models today are trained with some combination of massive pre-training and then optionally task-specific fine-tuning and alignment. I’m sure the precise details vary depending on the final product. For example, OpenAI’s Codex is a version of GPT-5 but optimized for agentic coding. Alignment Making LLMs bigger does not necessarily make them better at following a user’s intent or make them more aligned with human values. For example, we might not want conversational agents to lie, to make racist jokes, or to sexually harass the user. But nothing in the autoregressive framework accounts for this. We need to somehow encode these human values into the model. For some of these properties, we might be able to use a form of fine-tuning. There are datasets for this, such as the ETHICS dataset (Hendrycks et al., 2020) or the RealToxicityPrompts dataset (Gehman et al., 2020). But the limitations here are fairly obvious. And for many human values, it would be difficult to encode because the property itself is hard to define. To encode these properties, state-of-the-art LLMs are often trained using something called reinforcement learning with human feedback (RLHF). RLHF was developed around the same time as the transformer, in Deep reinforcement learning from human preferences (Christiano et al., 2017). The original motivation was how to expand the reinforcement learning (RL) framework beyond problems with well-specified reward functions. For example, RL has been used to great effect to play Go (Silver et al., 2016), Atari (Mnih et al., 2013), and Dota 2 (Berner et al., 2019), but what these tasks have in common is that their reward functions are relatively simple and their environments are relatively easy to simulate. But to borrow two examples from Christiano et al, how would you teach a machine-learning model to clean a table or to scramble an egg? It’s hard to come up with an objective function or simulation environment for these kinds of tasks. What we need, then, is a reward function that can be defined by human feedback and thus by human preferences. Broadly, RLHF is a three-step training procedure (Figure 11 11 11). First, humans are used to label a dataset which captures human preferences. For example, if the task is text summarization, the dataset might be different candidate summarizations, with the best summarization being defined by human scorers. Second, researchers train a reward function on these data, which predicts which output the humans would prefer. Finally, given this reward function, researchers can apply standard RL algorithms such as proximal policy optimization or PPO (Schulman et al., 2017) to fine-tune the model. Fine-tuning LLMs with RLHF is now fairly standard practice. For example, GPT-2 was fine-tuned this way in Fine-tuning language models from human preferences (Ziegler et al., 2019), while GPT-3 was fine-tuned this way in Training language models to follow instructions with human feedback (Ouyang et al., 2022) and in Learning to summarize with human feedback (Stiennon et al., 2020). And the GPT-4 whitepaper (Achiam et al., 2023) s