Arithmetic Without Numbers – How LLMs Do Math

If you learned arithmetic the ordinary human way, you probably learned it with a body. You counted on fingers. You grouped things into piles. You lined digits into columns. You carried a one. Later, perhaps, you used an abacus, graph paper, or a calculator.

A language model has none of that. It has matrices. Tokens enter, activations flow, logits come out. And yet, if you ask a modern language model for a greatest common divisor, a multiplication, or a division with remainder, something inside that matrix-only body responds.

Working vocabulary Token: one unit the model reads or prints. A token might be a word, part of a word, punctuation, or a chunk of digits. Vector: a list of numbers. A model stores each token's current state as a vector with many dimensions. Activation: the model's temporary internal state while it is processing a token. Readout: a small external model trained to recover a fact from an activation, such as the operation or an operand. Logit: a raw score for a possible next token. Higher logit means the model is more likely to print that token. Layer: one repeated processing step in the transformer. A modern model has many layers, each updating the running state. Residual stream: the main running vector passed from layer to layer, like a shared scratchpad without named variables. Attention: the part of a layer that lets one token position look at information from other positions. MLP / feed-forward block: the part of each transformer layer that transforms one token position's vector by itself. Attention lets positions exchange information; the MLP then reshapes the local vector, often strengthening, suppressing, or recombining features already present there. Next-token prediction: the training and generation rule for ordinary language models. Given the text so far, the model scores possible next tokens, prints one, then repeats the process. Phase: position around a repeating cycle, like the angle of a hand on a clock. Helix-style number codes use phase-like geometry. GCD / LCM: greatest common divisor and least common multiple. For example, gcd(84, 36) = 12 .

Rune began with the debugging question behind the jargon: when a language model gives an arithmetic answer, is it recalling a pattern, running something like an algorithm, or merely producing a plausible next token?