Apple taught an LLM to predict tokens up to 5x faster in math and coding tasks

A new research paper from Apple details a technique that speeds up large language model responses, while preserving output quality. Here are the details.

The nerdy bits

Traditionally, LLMs generate text one token at a time. This is slow because each step depends on all the previous ones to keep the output coherent and accurate.

If the model is writing a sentence like “ The cat is black ”, it predicts each token in sequence. After writing “ The cat is ”, it looks at everything so far (plus the user’s request, and patterns it learned during training) to calculate the probability of every possible next token in its vocabulary. That’s called autoregression.

In this scenario, it might rank options like black , tall , sleeping , grumpy , fluffy , skinny , purring , white , tired , playing , missing , meowing , cold , and so on, then choose the one that best fits the context.

What Apple did

In the study Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential, Apple’s team found that even though these models are usually trained to predict just the next token, they still carry useful information about several upcoming tokens.

Building on that, they developed a “multi-token prediction” (MTP) framework that lets the model produce multiple tokens at once.

If this sounds a bit like the diffusion model study we covered a few weeks ago, you’re not that far off. While the training process and the underlying technologies differ, both approaches aim at speeding up inference and getting to the result faster than with the one-token-at-a-time approach.

In this particular study, the researchers inserted special “mask” tokens into prompts, which are basically placeholders for upcoming words.

... continue reading