explaingit

openbmb/voxcpm

18,697PythonAudience · developerComplexity · 4/5LicenseSetup · hard

TLDR

VoxCPM is an open-source text-to-speech system that generates natural-sounding speech in 30 languages and can clone voices from short audio clips or create new voices from text descriptions.

Mindmap

mindmap
  root((VoxCPM))
    What it does
      Text to speech
      Voice cloning
      Voice design
    Tech
      Python
      Diffusion models
      Hugging Face weights
    Capabilities
      30 languages
      48kHz audio output
      Real-time streaming
    Audience
      AI developers
      App builders
      Content creators
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Convert a written article into spoken audio in 30 languages without any voice recording setup.

USE CASE 2

Clone a speaker's voice from a short audio sample to narrate new content in their style.

USE CASE 3

Build a multilingual voice assistant or podcast tool using a single open-source model.

USE CASE 4

Design a custom synthetic voice by describing its characteristics in text, no reference audio needed.

Tech stack

PythonPyTorchHugging Face

Getting it running

Difficulty · hard Time to first run · 1h+

Requires a GPU for practical use, the 2-billion parameter model is large and slow on CPU.

Apache 2.0, use freely for any purpose including commercial, as long as you keep the license and copyright notice.

In plain English

VoxCPM is a text-to-speech system, software that converts written text into spoken audio. Its main technical distinction is that it skips the usual step of breaking speech into discrete sound tokens, instead generating speech directly as continuous audio representations through an architecture that combines diffusion models with autoregressive generation. The project claims this approach produces more natural and expressive speech than tokenization-based systems. The current version, VoxCPM2, is a 2-billion parameter model trained on over 2 million hours of multilingual audio data across 30 languages. Beyond standard text-to-speech, it supports three additional capabilities: Voice Design (describing a voice in plain text and having the model create it without any reference recording), Controllable Voice Cloning (copying someone's voice from a short audio clip while optionally adjusting the style), and Ultimate Cloning (reproducing every detail of a voice by providing both the reference audio and its transcript). Output is 48kHz audio. Installation is via pip, and the model weights are available on Hugging Face. A Python API, command-line interface, and web demo are all provided. The model can run in real-time streaming mode and is released under the Apache 2.0 license, permitting commercial use.

Copy-paste prompts

Prompt 1
Using VoxCPM2, write Python code to convert a paragraph of English text into a 48kHz audio file using the default voice, include the pip install command and the full script.
Prompt 2
I have a 10-second audio clip of a speaker. Show me how to use VoxCPM's Controllable Voice Cloning feature to clone that voice and use it to narrate new text in Python.
Prompt 3
I want to use VoxCPM's Voice Design feature to create a 'calm female narrator with a British accent' without providing any reference audio. Write the Python code to do this.
Prompt 4
Write a Python script that uses VoxCPM2 in real-time streaming mode to play audio as it generates from a long piece of text, rather than waiting for the full file.
Open on GitHub → Explain another repo

← openbmb on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.