funaudiollm/cosyvoice

Analysis updated 2026-05-18

★ 20,898PythonAudience · developerComplexity · 3/5LicenseSetup · moderate

Mindmap

mindmap
  root((CosyVoice))
    What it does
      Text to speech
      Voice cloning
      Streaming audio
    Languages supported
      9 main languages
      18 Chinese dialects
    Key features
      Emotion control
      Speed adjustment
      Pinyin support
    Tech stack
      Python
      LLM-based
    Use cases
      Audiobook creation
      Accessibility tools
      Voice assistants

mindmap root((CosyVoice)) What it does Text to speech Voice cloning Streaming audio Languages supported 9 main languages 18 Chinese dialects Key features Emotion control Speed adjustment Pinyin support Tech stack Python LLM-based Use cases Audiobook creation Accessibility tools Voice assistants

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Clone a speaker's voice from a short audio sample and generate new speech in that voice without retraining.

USE CASE 2

Build an audiobook or podcast platform that reads text aloud in multiple languages with natural emotion and pacing.

USE CASE 3

Create a voice assistant or chatbot that speaks in different languages and dialects with controllable tone and speed.

USE CASE 4

Generate speech with precise pronunciation control using Pinyin for Chinese or phoneme notation for English.

What is it built with?

PythonPyTorchHugging FaceConda

How does it compare?

	funaudiollm/cosyvoice	1panel-dev/maxkb	marimo-team/marimo
Stars	20,898	20,884	20,818
Language	Python	Python	Python
Setup difficulty	moderate	moderate	moderate
Complexity	3/5	3/5	3/5
Audience	developer	developer	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

Requires PyTorch installation and downloading large model weights from Hugging Face, which can take 10-15 minutes depending on internet speed.

Use freely for any purpose including commercial. Keep the notice and disclose changes to the patent grant.

In plain English

CosyVoice is a Python project for turning written text into spoken audio. It is a text-to-speech system built on top of a large language model, designed to produce voices that sound natural across many languages, match a reference speaker's voice closely, and stay faithful to the original text. The repository covers the full pipeline: inference using the trained models, training so others can train their own, and deployment. The README says the latest version, Fun-CosyVoice 3.0, supports nine widely spoken languages including Chinese, English, Japanese, Korean, German, Spanish, French, Italian, and Russian, plus more than eighteen Chinese dialects and accents such as Cantonese, Sichuan, and Shanghainese. It can do zero-shot voice cloning: give it a short sample of a target speaker and it can synthesise new sentences in that voice, including across languages. Other features called out include pronunciation inpainting using Chinese Pinyin or English CMU phonemes, built-in text normalisation so numbers and special symbols are read correctly, instruction support for language, dialect, emotion, speed, and volume, and a bi-streaming mode where text streams in and audio streams out with latency as low as 150 milliseconds. People reach for CosyVoice when they want high-quality multilingual speech synthesis they can run themselves, for example to build voice chatbots, audiobook narrators, dubbing tools, or accessibility features that need controllable voices. The README walks through cloning the repo, creating a Conda environment with Python 3.10, downloading pretrained models from ModelScope or Hugging Face, and optionally running inference through vLLM for faster serving.

Copy-paste prompts

Prompt 1

How do I set up CosyVoice locally and generate speech from text in English and Chinese?

Prompt 2

Show me how to clone a speaker's voice using a short audio sample with CosyVoice's zero-shot voice cloning.

Prompt 3

How can I control emotion, speed, and volume when generating speech with CosyVoice?

Prompt 4

What's the best way to integrate CosyVoice's streaming output into a real-time voice application?

Prompt 5

How do I use Pinyin notation to fine-tune pronunciation for Chinese text in CosyVoice?

Frequently asked questions

What is cosyvoice?

Open-source text-to-speech system that converts written text to natural speech in 9 languages, with zero-shot voice cloning and streaming output.

What language is cosyvoice written in?

Mainly Python. The stack also includes Python, PyTorch, Hugging Face.

What license does cosyvoice use?

Use freely for any purpose including commercial. Keep the notice and disclose changes to the patent grant.

How hard is cosyvoice to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is cosyvoice for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub funaudiollm on gitmyhub

Verify against the repo before relying on details.