The Annotated Transformer

v2022: Austin Huang, Suraj Subramanian, Jonathan Sum, Khalid Almubarak, and Stella Biderman.

Original: Sasha Rush.

The Transformer has been on a lot of people’s minds over the last year five years. This post presents an annotated version of the paper in the form of a line-by-line implementation. It reorders and deletes some sections from the original paper and adds comments throughout. This document itself is a working notebook, and should be a completely usable implementation. Code is available here.

Table of Contents

Prelims

Skip

# !pip install -r requirements.txt

# # Uncomment for colab # # # !pip install -q torchdata==0.3.0 torchtext==0.12 spacy==3.2 altair GPUtil # !python -m spacy download de_core_news_sm # !python -m spacy download en_core_web_sm

import os from os.path import exists import torch import torch.nn as nn from torch.nn.functional import log_softmax, pad import math import copy import time from torch.optim.lr_scheduler import LambdaLR import pandas as pd import altair as alt from torchtext.data.functional import to_map_style_dataset from torch.utils.data import DataLoader from torchtext.vocab import build_vocab_from_iterator import torchtext.datasets as datasets import spacy import GPUtil import warnings from torch.utils.data.distributed import DistributedSampler import torch.distributed as dist import torch.multiprocessing as mp from torch.nn.parallel import DistributedDataParallel as DDP # Set to False to skip notebook execution (e.g. for debugging) warnings.filterwarnings("ignore") RUN_EXAMPLES = True

... continue reading