Open (Apache 2.0) TTS model for streaming conversational audio in realtime

Dia2 is a streaming dialogue TTS model created by Nari Labs.

The model does not need the entire text to produce the audio, and can start generating as the first few words are given as input. You can condition the output on audio, enabling natural conversations in realtime.

We provide model checkpoints (1B, 2B) and inference code to accelerate research. The model only supports up to 2 minutes of generation in English.

⚠️ Quality and voices vary per generation, as the model is not fine-tuned on a specific voice. Use with prefix or fine-tune in order to obtain stable output.

Try it now on Hugging Face Spaces

Upcoming

Bonsai (JAX) implementation

Dia2 TTS Server: Real streaming support

Sori: Dia2-powered speech-to-speech engine written in Rust

Quickstart

... continue reading