Paper Tape Is All You Need – Training a Transformer on a 1976 Minicomputer

ATTN/11 - Paper Tape Is All You Need

A single-layer, single-head transformer written in PDP-11 assembly language.

This project is the spiritual successor of Xortran, a neural network that learns XOR with backpropagation in Fortran IV on the IBM 1130 (1965) and PDP-11/20 (1970).

The natural next step was to see if those machines could successfully train a small transformer in an acceptable amount of time (a few hours).

Architecturally, a transformer is actually a fairly modest extension of a basic neural network. The building blocks such as matrix multiplies, backpropagation, SGD, and cross-entropy are already there.

The three new components are:

Self-attention: dot-product score between projected queries and keys

Positional encoding: learned position embeddings, added to the input

Softmax: to turn scores into a probability distribution

The goal is to train the Transformer to reverse a sequence of digits. Despite its apparent simplicity, reversal is not a trivial task for a neural network: the model must learn to route each token to a position that depends only on its index, with no content-based shortcut. This is the kind of problem that self-attention is designed for, and is in fact one of the algorithmic benchmarks included in Tensor2Tensor, Google's reference implementation of the original transformer in 2017.

... continue reading