February 12, 2025

Install AI Voice Cloning and Text-To-Speech Model – Zonos – Install and Run Locally

In this tutorial, we explain how to install and run on a local computer an AI model for converting text to speech and for speech cloning. The name of this model is Zonos, version 0.1. It is released under the Apache 2.0 license which is a very permissive license. The model can generate audio files based on the text prompt and a voice sample. For example, you can create a sample of your voice, enter a text prompt, and then the model can generate an audio file with a voice that resembles your voice. The model can control speaking rate, pitch variation, audio quality, as well as emotions such as happiness, anger, sadness, etc. The model also supports popular languages, such as English, Japanese, French, Chinese, and German, as well as other languages.

The YouTube tutorial is given below.

Install Zonos: AI Voice Cloning and Text-To-Speech Model

We install Zonos on Linux Ubuntu 24.04. First, you have to install

  1. NVIDIA Cuda Toolkit. To install NVIDIA CUDA Toolkit, follow the tutorial given here.
  2. Anaconda. To install Anaconda on Linux Ubuntu, follow the tutorial given here.

The next step to update the system and install espeak-ng:

sudo apt update && sudo apt upgrade

sudo apt install -y espeak-ng

Next, go to the home folder and clone the Zonos repository:

cd ~
git clone https://github.com/Zyphra/Zonos

Next, create the Conda virtual environment and activate the virtual environment:

cd Zonos
conda create -n ZonosEnv python=3.10
conda activate ZonosEnv

Next, install all the libraries and compile everything:

pip install -e .
pip install --no-build-isolation -e .[compile]

The next step is to create an audio speech sample that will be cloned by the model. To do that, we will use a simple command line tool for recording sound and voice on Linux Ubuntu. The name of this tool is arecord. To record your voice run this command

arecord -f cd source1.mp3

While you are recording, the command will run. Once you are done with recording of your voice, press CTRL+c. The generated file with your voice is source1.mp3. You will use this file in the Python code. Open your favorite Python editor and type this code:

import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

# Use the hybrid with "Zyphra/Zonos-v0.1-hybrid"
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

wav, sampling_rate = torchaudio.load("source1.mp3")
speaker = model.make_speaker_embedding(wav, sampling_rate)

torch.manual_seed(421)

cond_dict = make_cond_dict(text="This is a test code for generating the voice clone by using Zonos. I should sound like you if I am properly being installed and trained!", speaker=speaker, language="en-us")
conditioning = model.prepare_conditioning(cond_dict)

codes = model.generate(conditioning)

wavs = model.autoencoder.decode(codes).cpu()
torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)

Save this file as test.py and run the code (see the YouTube tutorial):

python3 test.py

The code will first download the model files from the Huggingface repository, and finally, it will speak this text:

This is a test code for generating the voice clone by using Zonos. I should sound like you if I am properly being installed and trained!

The sound will be saved in the file sample.wav, which you can play by using any media player. For more details see the YouTube tutorial given above. Another approach for running the Zonos model is to use the gradio user interface run this

python3 gradio_interface.py

for more details on how to use the gradio interface for Zonos, see the YouTube tutorial given above.