Real-Time Voice Cloning
This repository is an implementation of Transfer Learning from Speaker Verification to
Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time. This was my master's thesis.
SV2TTS is a deep learning framework in three stages. In the first stage, one creates a digital representation of a voice from a few seconds of audio. In the second and third stages, this representation is used as reference to generate speech given arbitrary text.
Video demonstration (click the picture):

Papers implemented
Heads up
Like everything else in Deep Learning, this repo has quickly gotten old. Many SaaS apps (often paying) will give you a better audio quality than this repository will. If you wish for an open-source solution with a high voice quality:
- Check out paperswithcode for other repositories and recent research in the field of speech synthesis.
- Check out Chatterbox for a similar project up to date with the 2025 SOTA in voice cloning
Running the toolbox
Both Windows and Linux are supported.