TADA: Speech generation through text-acoustic synchronization

The future of voice AI hinges on sounding natural, fast, expressive, and free of quirks like hallucinated words or skipped content. Today's LLM-based TTS systems are forced to choose between speed, quality, and reliability because of a fundamental mismatch between how text and audio are represented inside language models.

TADA (Text-Acoustic Dual Alignment) resolves that mismatch with a novel tokenization schema that synchronizes text and speech one-to-one. The result: the fastest LLM-based TTS system available, with competitive voice quality, virtually zero content hallucinations, and a footprint light enough for on-device deployment.

Hume AI is open-sourcing TADA to accelerate progress toward efficient, reliable voice generation. Code and pre-trained models are available now.

live demo

Adoration Speech 0:00 0:00

Fearful Speech 0:00 0:00

Anger Speech 0:00 0:00

Long Speech 0:00 0:00

Approach

For every second of spoken audio, the acoustic signal carries far more information than the corresponding text. A second of audio might be 2–3 text tokens but 12.5–25 acoustic frames. This mismatch means LLM-based TTS systems must manage sequences where audio tokens vastly outnumber text tokens — leading to longer context windows, higher memory consumption, slower inference, and more opportunities for the model to lose track of what it's supposed to say.

... continue reading