In this tutorial, we explain how to download, install, and run locally Kokoro-82M on a Linux Ubuntu computer. Kokoro is an open-weight and open-source text-to-speech model or briefly TTS model. Its main advantage is that it is lightweight, however, at the same time, it delivers comparable quality to larger models. Due to its relatively small number of parameters, it is faster and more cost-efficient than larger models. You can integrate Kokoro-82M in robotics projects. Namely, Kokoro-82M, large language, as well as other AI models can give the ability to robot express itself like a human being. For example, in a practical application, you would use this model to develop a personal AI assistant or enable a robot to communicate with humans.
The YouTube tutorial is given below.
Installation Procedure for Kokoro on Linux Ubuntu
We are using Linux Ubuntu 24.04. First open a terminal and type
sudo apt-get update && sudo apt-get upgrade
sudo apt-get install espeak-ng
Then, verify that you have Python installed on your computer by typing
which python3
This command should return the path of the Python interpreter file. Next, verify your Python version
python3 --version
In our case, we are using Python 3.12. Next, let us create workspace folder and create/activate Python virtual environments
sudo apt install python3.12-venv
cd ~
mkdir kokoro
cd kokoro
python3 -m venv env1
source env1/bin/activate
Next, install the necessary libraries
pip install kokoro
pip install soundfile
The next step is to write the Python code. The code is given below.
from kokoro import KPipeline
import soundfile as sf
# 🇺🇸 'a' => American English, 🇬🇧 'b' => British English
# 🇯🇵 'j' => Japanese: pip install misaki[ja]
# 🇨🇳 'z' => Mandarin Chinese: pip install misaki[zh]
pipeline = KPipeline(lang_code='a') # <= make sure lang_code matches voice
# This text is for demonstration purposes only, unseen during training
text = '''
In this tutorial, we explain how to download, install, and run locally
Kokoro on Windows computer. Kokoro is an open-weight text to speech model
or briefly TTS model. Its main advantages is that it is lightweight,
however, at the same time it delivers comparably quality to larger models.
Due to its relatively small number of parameters it is faster and
more cost-efficient than larger models.
In this tutorial, we will thoroughly explain all the steps you
need to perform in order to run the model.
In a practical application, you would use this model to develop
a personal AI assistant, or to enable a computer to communicate with humans.
'''
# af_nicole
generator = pipeline(
text, voice='af_nicole', #
speed=1, split_pattern=r'\n+'
)
for i, (gs, ps, audio) in enumerate(generator):
print(i) # i => index
print(gs) # gs => graphemes/text
print(ps) # ps => phonemes
sf.write(f'{i}.wav', audio, 24000) # save each audio file
This code will convert the text to speech. It will convert and store every sentence in an independent wav file that will be saved in the workspace folder (for more details see the YouTube tutorial). Here, we use a speech style specified by “af_nicole”. For other speech styles, see this link
https://huggingface.co/hexgrad/Kokoro-82M/tree/main/voices
and see the YouTube tutorial for the complete explanation.