Build a voice assistant that speaks naturally in 30 different languages without pre-recorded voice samples.
Create custom synthetic voices by describing them in plain text, then use them in your app without recording audio.
Clone someone's voice from a short audio clip and adjust the speaking style while preserving their unique characteristics.
Generate high-quality 48kHz speech in real-time for live applications like video games or interactive chatbots.
Requires downloading large pre-trained diffusion model weights from Hugging Face; GPU recommended for reasonable inference speed.
VoxCPM is a text-to-speech system, software that converts written text into spoken audio. Its main technical distinction is that it skips the usual step of breaking speech into discrete sound tokens, instead generating speech directly as continuous audio representations through an architecture that combines diffusion models with autoregressive generation. The project claims this approach produces more natural and expressive speech than tokenization-based systems. The current version, VoxCPM2, is a 2-billion parameter model trained on over 2 million hours of multilingual audio data across 30 languages. Beyond standard text-to-speech, it supports three additional capabilities: Voice Design (describing a voice in plain text and having the model create it without any reference recording), Controllable Voice Cloning (copying someone's voice from a short audio clip while optionally adjusting the style), and Ultimate Cloning (reproducing every detail of a voice by providing both the reference audio and its transcript). Output is 48kHz audio. Installation is via pip, and the model weights are available on Hugging Face. A Python API, command-line interface, and web demo are all provided. The model can run in real-time streaming mode and is released under the Apache 2.0 license, permitting commercial use.
Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.