Skip to content
Tech News
← Back to articles

Paper Tape Is All You Need – Training a Transformer on a 1976 Minicomputer

read original get Mini Computer Paper Tape Kit → more articles
Why This Matters

This groundbreaking project demonstrates that even a simple, single-layer transformer can be trained on vintage hardware like the PDP-11, highlighting the minimal requirements for neural network training and the enduring relevance of transformer architectures. It underscores the potential for accessible, low-resource AI development and offers insights into the fundamental workings of transformers for both researchers and enthusiasts.

Key Takeaways

ATTN/11 - Paper Tape Is All You Need

A single-layer, single-head transformer written in PDP-11 assembly language.

This project is the spiritual successor of Xortran, a neural network that learns XOR with backpropagation in Fortran IV on the IBM 1130 (1965) and PDP-11/20 (1970).

The natural next step was to see if those machines could successfully train a small transformer in an acceptable amount of time (a few hours).

Architecturally, a transformer is actually a fairly modest extension of a basic neural network. The building blocks such as matrix multiplies, backpropagation, SGD, and cross-entropy are already there.

The three new components are:

Self-attention: dot-product score between projected queries and keys

Positional encoding: learned position embeddings, added to the input

Softmax: to turn scores into a probability distribution

The goal is to train the Transformer to reverse a sequence of digits. Despite its apparent simplicity, reversal is not a trivial task for a neural network: the model must learn to route each token to a position that depends only on its index, with no content-based shortcut. This is the kind of problem that self-attention is designed for, and is in fact one of the algorithmic benchmarks included in Tensor2Tensor, Google's reference implementation of the original transformer in 2017.

... continue reading