Sopro TTS: A 169M model with zero-shot voice cloning that runs on the CPU

sopro_readme.mp4

Sopro TTS

Sopro (from the Portuguese word for “breath/blow”) is a lightweight English text-to-speech model I trained as a side project. Sopro is composed of dilated convs (à la WaveNet) and lightweight cross-attention layers, instead of the common Transformer architecture. Even though Sopro is not SOTA across most voices and situations, I still think it’s a cool project made with a very low budget (trained on a single L40S GPU), and it can be improved with better data.

Some of the main features are:

169M parameters

Streaming

Zero-shot voice cloning

0.25 RTF on CPU (measured on an M3 base model), meaning it generates 30 seconds of audio in 7.5 seconds

(measured on an M3 base model), meaning it generates 30 seconds of audio in 7.5 seconds 3-12 seconds of reference audio for voice cloning

Instructions

... continue reading