Real-Time Voice Cloning
This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time. This was my master's thesis.
SV2TTS is a deep learning framework in three stages. In the first stage, one creates a digital representation of a voice from a few seconds of audio. In the second and third stages, this representation is used as reference to generate speech given arbitrary text.
Video demonstration (click the picture):
Papers implemented
URL Designation Title Implementation source 1806.04558 SV2TTS Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis This repo 1802.08435 WaveRNN (vocoder) Efficient Neural Audio Synthesis fatchord/WaveRNN 1703.10135 Tacotron (synthesizer) Tacotron: Towards End-to-End Speech Synthesis fatchord/WaveRNN 1710.10467 GE2E (encoder) Generalized End-To-End Loss for Speaker Verification This repo
Heads up
Like everything else in Deep Learning, this repo has quickly gotten old. Many SaaS apps (often paying) will give you a better audio quality than this repository will. If you wish for an open-source solution with a high voice quality:
Check out paperswithcode for other repositories and recent research in the field of speech synthesis.
Check out Chatterbox for a similar project up to date with the 2025 SOTA in voice cloning
... continue reading