Clone a speaker's voice from a short audio sample and generate new speech in that voice without retraining.
Build an audiobook or podcast platform that reads text aloud in multiple languages with natural emotion and pacing.
Create a voice assistant or chatbot that speaks in different languages and dialects with controllable tone and speed.
Generate speech with precise pronunciation control using Pinyin for Chinese or phoneme notation for English.
Requires PyTorch installation and downloading large model weights from Hugging Face, which can take 10-15 minutes depending on internet speed.
CosyVoice is a Python project for turning written text into spoken audio. It is a text-to-speech system built on top of a large language model, designed to produce voices that sound natural across many languages, match a reference speaker's voice closely, and stay faithful to the original text. The repository covers the full pipeline: inference using the trained models, training so others can train their own, and deployment. The README says the latest version, Fun-CosyVoice 3.0, supports nine widely spoken languages including Chinese, English, Japanese, Korean, German, Spanish, French, Italian, and Russian, plus more than eighteen Chinese dialects and accents such as Cantonese, Sichuan, and Shanghainese. It can do zero-shot voice cloning: give it a short sample of a target speaker and it can synthesise new sentences in that voice, including across languages. Other features called out include pronunciation inpainting using Chinese Pinyin or English CMU phonemes, built-in text normalisation so numbers and special symbols are read correctly, instruction support for language, dialect, emotion, speed, and volume, and a bi-streaming mode where text streams in and audio streams out with latency as low as 150 milliseconds. People reach for CosyVoice when they want high-quality multilingual speech synthesis they can run themselves, for example to build voice chatbots, audiobook narrators, dubbing tools, or accessibility features that need controllable voices. The README walks through cloning the repo, creating a Conda environment with Python 3.10, downloading pretrained models from ModelScope or Hugging Face, and optionally running inference through vLLM for faster serving.
Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.