What’s the strongest model I can train on my MacBook Pro in five minutes?
I’ll give the answer upfront: the best 5-minute model I could train was a ~1.8M-param GPT-style transformer trained on ~20M TinyStories tokens, reaching ~9.6 perplexity on a held-out split. Here’s an example of the output, with the prompt bolded:
Once upon a time, there was a little boy named Tim. Tim had a small box that he liked to play with. He would push the box to open. One day, he found a big red ball in his yard. Tim was so happy. He picked it up and showed it to his friend, Jane. “Look at my bag! I need it!” she said. They played with the ball all day and had a great time.
OK, so it’s not great. But it’s not bad for five minutes!
The challenge
I’ve been interested in this silly question for a few days. It’s a silly question for two reasons. First, anyone who can afford a MacBook can afford to rent half an hour on a H100 and train a model that’s several orders of magnitude more powerful. Second, if you were forced to train on a weaker device like a laptop, there’s no reason to limit yourself to five minutes (and no reason to think it would even be possible to train a strong model in that time).
Other training challenges like BabyLM restrict the training data, which makes sense - some domains might have very little data, so it’s useful to know how you can most effectively train a model when data is scarce. It’s also a popular research goal to try and train the smallest strong model, which also makes sense, since you can run small models on phones and portable devices. But I can use as much training data as I want, and as large of a model as I want. My main limitation is time.
In five minutes, you just can’t push that many tokens through a model. That means that large models are out of the question, since it takes longer per-token to train a larger model. Better to train a 1M param model on 4M tokens than a 1B param model on 4,000 tokens. But of course you can’t go too small. In five minutes I can move a lot of tokens through a tiny 10k param model, but that model is just not large enough to learn English grammar. The training loss plateaus in the first thirty seconds and doesn’t move after that, and the model just outputs gibberish.
Pushing more tokens-per-second
My first goal was to figure out what performance optimizations would actually be helpful at this tiny scale. My first textbook GPT-2-style transformer trained at ~3000 tokens per second (using Apple’s MPS). Interestingly, math-based optimizations either slowed me down or had no meaningful effect: using torch.compile on the model, switching to float16 , and so on.
... continue reading