In a new study, Apple researchers present a diffusion model that can write up to 128 times faster than its counterparts. Here’s how it works. The nerdy bits Here’s what you need to know for this study: LLMs such as ChatGPT are autoregressive models. They generate text sequentially, one token at a time, taking into account both the user’s prompt and all previously generated tokens. In contrast to autoregressive models, there are diffusion models. They generate multiple tokens in parallel and refine them over several iterative steps until the full response takes shape. Finally, one variant of diffusion models is flow-matching models, which basically skip the iterative process of diffusion models and learn to generate the final result in one go. For a deeper dive into how diffusion models work, check out this post on Apple’s diffusion-based coding model. And to learn more about flow-matching models, check out this post on Apple’s flow-matching model for protein folding. Apple’s new study In a study published today, titled “FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models,” researchers from Apple and Ohio State University propose a new model called Few-Step Discrete Flow-Matching, or FS-DFM. In the study, the researchers demonstrate that FS-DFM was able to write full-length passages with just eight quick refinement rounds, matching the quality of diffusion models that required over a thousand steps to achieve a similar result. To achieve that, the researchers take an interesting three-step approach: first, the model is trained to handle different budgets of refinement iterations. Then, they use a guiding “teacher” model to help it make larger, more accurate updates at each iteration without “overshooting” the intended text. And finally, they tweak how each iteration works so the model can reach the final result in fewer, steadier steps. When compared with larger diffusion models, FS-DFM performed well in two important metrics: perplexity and entropy. In a nutshell, the perplexity score is a standard metric for text quality in language models. The lower the perplexity, the more accurate and natural the text sounds. As for entropy, it essentially measures how confidently the model selects each word. In practice, if entropy is too low, the text can become repetitive or predictable, but if it’s too high, it can start to sound random or incoherent. Compared with the Dream diffusion model with 7 billion parameters and the LLaDA diffusion model with 8 billion parameters, FS-DFM variants with 1.7, 1.3, and 0.17 billion parameters consistently achieved lower perplexity and maintained more stable entropy across all iteration counts. Given the results and the promise this method shows, and the lack of similar models and studies available, the researchers also said they “plan to release code and model checkpoints to facilitate reproducibility and further research.” If you’d like to dive deeper into Apple’s methods and more specific implementation details of Apple’s models, be sure to check the full paper on arXiv. It features multiple performance examples, such as this one, that color-codes the iteration at which each word was last changed: Figure 9: Token-level generation timeline. The displayed text is the final sample; the background of each token encodes the step of its last change using eight light colors (start →end). Early-stabilized tokens appear in early hues, while late edits trend toward end hues, making localized refinements and overall convergence easy to see. Note that many tokens are colored yellow, indicating they were predicted early in the process. This is due to the cumulative scalar (contrast with Figure 4). Find “FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models” on arXiv. Accessory deals on Amazon