Tech News
← Back to articles

TADA: Fast, Reliable Speech Generation Through Text-Acoustic Synchronization

read original related products more articles

The future of voice AI hinges on sounding natural, fast, expressive, and free of quirks like hallucinated words or skipped content. Today's LLM-based TTS systems are forced to choose between speed, quality, and reliability because of a fundamental mismatch between how text and audio are represented inside language models.

TADA (Text-Acoustic Dual Alignment) resolves that mismatch with a novel tokenization schema that synchronizes text and speech one-to-one. The result: the fastest LLM-based TTS system available, with competitive voice quality, virtually zero content hallucinations, and a footprint light enough for on-device deployment.

Hume AI is open-sourcing TADA to accelerate progress toward efficient, reliable voice generation. Code and pre-trained models are available now.

live demo

Adoration Speech 0:00 0:00

Fearful Speech 0:00 0:00

Anger Speech 0:00 0:00

Long Speech 0:00 0:00

Approach

For every second of spoken audio, the acoustic signal carries far more information than the corresponding text. A second of audio might be 2–3 text tokens but 12.5–25 acoustic frames. This mismatch means LLM-based TTS systems must manage sequences where audio tokens vastly outnumber text tokens — leading to longer context windows, higher memory consumption, slower inference, and more opportunities for the model to lose track of what it's supposed to say.

... continue reading