A group of Apple and Tel-Aviv University researchers figured out a way to speed up AI-based text-to-speech generation without sacrificing intelligibility. Here’s how they did it.
An interesting new approach to generating speech faster
In a new paper titled Principled Coarse-Grained Acceptance for Speculative Decoding in Speech, Apple researchers detail an interesting approach to generating speech from text.
While there are currently multiple approaches to generating speech from text, the researchers focused on autoregressive text-to-speech models, which generate speech tokens one at a time.
If you’ve ever looked up how most large language models work, you’re probably familiar with autoregressive models, which predict the next token based on all the tokens that came before.
Autoregressive speech generation works in a generally similar way, except that the tokens represent audio chunks rather than words or characters.
And while this is an efficient way to generate speech from text, this approach also creates a processing bottleneck, as Apple’s researchers explain:
However, for speech LLMs that generate acoustic tokens, exact token matching is overly restrictive: many discrete tokens are acoustically or semantically interchangeable, reducing acceptance rates and limiting speedups.
In other words, autoregressive speech models can be too strict, often rejecting predictions that would be good enough, simply because they don’t match the exact token the model expects. This, in turn, slows everything down.
Enter, Principled Coarse-Graining (PCG)
... continue reading