The Annotated Transformer
v2022: Austin Huang, Suraj Subramanian, Jonathan Sum, Khalid Almubarak, and Stella Biderman.
Original: Sasha Rush.
The Transformer has been on a lot of people’s minds over the last year five years. This post presents an annotated version of the paper in the form of a line-by-line implementation. It reorders and deletes some sections from the original paper and adds comments throughout. This document itself is a working notebook, and should be a completely usable implementation. Code is available here.
Table of Contents
Prelims
Skip
# !pip install -r requirements.txt
# # Uncomment for colab # # # !pip install -q torchdata==0.3.0 torchtext==0.12 spacy==3.2 altair GPUtil # !python -m spacy download de_core_news_sm # !python -m spacy download en_core_web_sm
import os from os.path import exists import torch import torch.nn as nn from torch.nn.functional import log_softmax, pad import math import copy import time from torch.optim.lr_scheduler import LambdaLR import pandas as pd import altair as alt from torchtext.data.functional import to_map_style_dataset from torch.utils.data import DataLoader from torchtext.vocab import build_vocab_from_iterator import torchtext.datasets as datasets import spacy import GPUtil import warnings from torch.utils.data.distributed import DistributedSampler import torch.distributed as dist import torch.multiprocessing as mp from torch.nn.parallel import DistributedDataParallel as DDP # Set to False to skip notebook execution (e.g. for debugging) warnings.filterwarnings("ignore") RUN_EXAMPLES = True
... continue reading