NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute

We've achieved 10x data efficiency with NanoGPT Slowrun within a few weeks. An ensemble of 1.8B parameter models (18B total params) trained on 100M tokens matches what would normally require 1B tokens with a standard LM baseline. Data efficiency matters because compute grows much faster than data . Since our current scaling laws require proportional increases in both , intelligence will eventually be bottlenecked by data, not compute. This data efficiency result allows us to improve model performance by scaling with compute rather than with data.

NanoGPT Slowrun 3.8 × data efficiency

A few things worth noting. First, this looks nothing like our current scaling laws. Chinchilla says you should train a ~5M parameter model if you have 100M tokens -- a staggering 3600x difference from what we're doing. Second, 10x data efficiency would've seemed unimaginable to most people, and we got there in ... a few weeks. Here's how. Some of the trends are architectural tweaks without a lot of principles behind them. But a few are principled, and we believe they will transfer to larger scales. Those are what matter fundamentally.

Ensemble

Ensembling is probably the most understudied axis of scaling in pretraining. Instead of training one model, you train many models somewhat independently and aggregate their predictions at inference. This way, you can keep leveraging more compute under fixed data and keep improving generalization.

Training dynamics for ensembles are very different than for a single model. This is a key insight. Pandey et al. show that post-hoc transforms like ensembling reverse the usual overfitting dynamics: while base models overfit with more training, ensembles favor base models trained for more epochs. Kim et al. independently find that ensembling allows for much longer training than a single model.

We see exactly this. In PR #26, we extended training from 12 to 18 epochs. Individual model loss went from 3.295 to 3.310 -- it got worse. But ensemble loss dropped from 3.185 to 3.166. The models learn different things when you push them past their individual optimum, and that helps the ensemble.

Chain distillation. We've found that chain knowledge distillation dramatically improves ensemble training (PR #31). The idea, inspired by Born-Again Neural Networks , is to train models sequentially, where each new model distills from the immediately preceding one:

Algorithm: Chain Distillation Ensemble 1. Train model M_1 on data D with standard cross-entropy loss. 2. For k = 2, ..., K: a. Load M_{k-1} as a frozen teacher. b. Train model M_k from scratch on D with loss: L = (1 - α) · CE(M_k(x), y) + α · T² · KL(M_k(x)/T ‖ M_{k-1}(x)/T) where α = 0.5, T = 1.0 c. Discard teacher from memory. 3. At inference, ensemble all K models by averaging logits.

Note that only the immediately preceding model serves as teacher, not the full ensemble of prior models. This keeps memory constant and training fast. With 8 models trained this way in the chain distillation PR, individual model loss plateaus around 3.20, but ensemble loss hits 3.126 -- taking us from 7x to 8x data efficiency.

... continue reading