Convert a written article into spoken audio in 30 languages without any voice recording setup.
Clone a speaker's voice from a short audio sample to narrate new content in their style.
Build a multilingual voice assistant or podcast tool using a single open-source model.
Design a custom synthetic voice by describing its characteristics in text, no reference audio needed.
Requires a GPU for practical use, the 2-billion parameter model is large and slow on CPU.
VoxCPM is a text-to-speech system, software that converts written text into spoken audio. Its main technical distinction is that it skips the usual step of breaking speech into discrete sound tokens, instead generating speech directly as continuous audio representations through an architecture that combines diffusion models with autoregressive generation. The project claims this approach produces more natural and expressive speech than tokenization-based systems. The current version, VoxCPM2, is a 2-billion parameter model trained on over 2 million hours of multilingual audio data across 30 languages. Beyond standard text-to-speech, it supports three additional capabilities: Voice Design (describing a voice in plain text and having the model create it without any reference recording), Controllable Voice Cloning (copying someone's voice from a short audio clip while optionally adjusting the style), and Ultimate Cloning (reproducing every detail of a voice by providing both the reference audio and its transcript). Output is 48kHz audio. Installation is via pip, and the model weights are available on Hugging Face. A Python API, command-line interface, and web demo are all provided. The model can run in real-time streaming mode and is released under the Apache 2.0 license, permitting commercial use.
← openbmb on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.