explaingit

funaudiollm/cosyvoice

20,898PythonAudience · developerComplexity · 3/5ActiveLicenseSetup · moderate

TLDR

Open-source text-to-speech system that converts written text to natural speech in 9 languages, with zero-shot voice cloning and streaming output.

Mindmap

mindmap
  root((CosyVoice))
    What it does
      Text to speech
      Voice cloning
      Streaming audio
    Languages supported
      9 main languages
      18 Chinese dialects
    Key features
      Emotion control
      Speed adjustment
      Pinyin support
    Tech stack
      Python
      LLM-based
    Use cases
      Audiobook creation
      Accessibility tools
      Voice assistants

Things people build with this

USE CASE 1

Clone a speaker's voice from a short audio sample and generate new speech in that voice without retraining.

USE CASE 2

Build an audiobook or podcast platform that reads text aloud in multiple languages with natural emotion and pacing.

USE CASE 3

Create a voice assistant or chatbot that speaks in different languages and dialects with controllable tone and speed.

USE CASE 4

Generate speech with precise pronunciation control using Pinyin for Chinese or phoneme notation for English.

Tech stack

PythonPyTorchHugging FaceConda

Getting it running

Difficulty · moderate Time to first run · 30min

Requires PyTorch installation and downloading large model weights from Hugging Face, which can take 10-15 minutes depending on internet speed.

Use freely for any purpose including commercial. Keep the notice and disclose changes to the patent grant.

In plain English

CosyVoice is a Python project for turning written text into spoken audio. It is a text-to-speech system built on top of a large language model, designed to produce voices that sound natural across many languages, match a reference speaker's voice closely, and stay faithful to the original text. The repository covers the full pipeline: inference using the trained models, training so others can train their own, and deployment. The README says the latest version, Fun-CosyVoice 3.0, supports nine widely spoken languages including Chinese, English, Japanese, Korean, German, Spanish, French, Italian, and Russian, plus more than eighteen Chinese dialects and accents such as Cantonese, Sichuan, and Shanghainese. It can do zero-shot voice cloning: give it a short sample of a target speaker and it can synthesise new sentences in that voice, including across languages. Other features called out include pronunciation inpainting using Chinese Pinyin or English CMU phonemes, built-in text normalisation so numbers and special symbols are read correctly, instruction support for language, dialect, emotion, speed, and volume, and a bi-streaming mode where text streams in and audio streams out with latency as low as 150 milliseconds. People reach for CosyVoice when they want high-quality multilingual speech synthesis they can run themselves, for example to build voice chatbots, audiobook narrators, dubbing tools, or accessibility features that need controllable voices. The README walks through cloning the repo, creating a Conda environment with Python 3.10, downloading pretrained models from ModelScope or Hugging Face, and optionally running inference through vLLM for faster serving.

Copy-paste prompts

Prompt 1
How do I set up CosyVoice locally and generate speech from text in English and Chinese?
Prompt 2
Show me how to clone a speaker's voice using a short audio sample with CosyVoice's zero-shot voice cloning.
Prompt 3
How can I control emotion, speed, and volume when generating speech with CosyVoice?
Prompt 4
What's the best way to integrate CosyVoice's streaming output into a real-time voice application?
Prompt 5
How do I use Pinyin notation to fine-tune pronunciation for Chinese text in CosyVoice?
Open on GitHub → Explain another repo

Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.