CorentinJ/Real-Time-Voice-Cloning — Clone a voice in 5 seconds to generate arbitrary speech

Real-Time Voice Cloning

This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time. This was my master's thesis.

SV2TTS is a deep learning framework in three stages. In the first stage, one creates a digital representation of a voice from a few seconds of audio. In the second and third stages, this representation is used as reference to generate speech given arbitrary text.

Video demonstration (click the picture):

Papers implemented

URL	Designation	Title	Implementation source
1806.04558	SV2TTS	Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis	This repo
1802.08435	WaveRNN (vocoder)	Efficient Neural Audio Synthesis	fatchord/WaveRNN
1703.10135	Tacotron (synthesizer)	Tacotron: Towards End-to-End Speech Synthesis	fatchord/WaveRNN
1710.10467	GE2E (encoder)	Generalized End-To-End Loss for Speaker Verification	This repo

Heads up

Like everything else in Deep Learning, this repo has quickly gotten old. Many SaaS apps (often paying) will give you a better audio quality than this repository will. If you wish for an open-source solution with a high voice quality:

Check out paperswithcode for other repositories and recent research in the field of speech synthesis.
Check out Chatterbox for a similar project up to date with the 2025 SOTA in voice cloning

Running the toolbox

Both Windows and Linux are supported.

Real-Time-Voice-Cloning

Quick Overview

Scores

Trust Score

Maintenance

Popularity

Star History

Snapshot Versions

Alternatives

tensorflow

stable-diffusion-webui

transformers

pytorch

LLMs-from-scratch

opencv

Community Reviews

README

Real-Time Voice Cloning

Papers implemented

Heads up

Running the toolbox