explaingit

resemble-ai/chatterbox

24,757PythonAudience · developerComplexity · 2/5ActiveLicenseSetup · moderate

TLDR

Open-source text-to-speech models that convert written text into realistic spoken audio, with support for voice cloning and 23+ languages.

Mindmap

mindmap
  root((Chatterbox))
    What it does
      Text to speech
      Voice cloning
      Multiple languages
    Models
      Turbo fast
      Multilingual 23 langs
      Paralinguistic tags
    Features
      Zero-shot cloning
      Expressiveness control
      Audio watermarking
    Use cases
      Voice agents
      Audiobooks
      Localization
      Interactive media
    Tech stack
      Python
      PyTorch

Things people build with this

USE CASE 1

Build voice agents that speak naturally with emotional expression and laughter.

USE CASE 2

Create audiobooks in multiple languages without hiring voice actors.

USE CASE 3

Localize apps and games into 23+ languages with realistic AI voices.

USE CASE 4

Generate character voices for interactive media by cloning a short audio sample.

Tech stack

PythonPyTorch

Getting it running

Difficulty · moderate Time to first run · 30min

PyTorch installation and model downloading can take time depending on internet speed and GPU availability.

Use freely for any purpose including commercial, as long as you keep the copyright notice.

In plain English

Chatterbox is a family of open-source text-to-speech (TTS) models, software that converts written text into realistic spoken audio. It is built by Resemble AI and represents their state-of-the-art open-source offering. The library includes three models. Chatterbox-Turbo is the fastest and most efficient, built on a 350 million parameter neural network. It supports paralinguistic tags, special markers in the text like [laugh] or [cough] that make the generated speech sound more natural and human. Chatterbox handles English and supports creative controls like adjusting the expressiveness of the voice. Chatterbox-Multilingual supports over 23 languages including French, Chinese, Japanese, and Arabic. All three models support zero-shot voice cloning, meaning you can provide a short audio clip of a real person speaking and the model will generate new speech that sounds like that person, without any special training required. You would use Chatterbox when you need AI-generated voices for voice agents, audiobooks, localization, interactive media, or any application that turns text into speech. The watermarking system baked in adds invisible neural markers to all generated audio, helping identify AI-generated content. The tech stack is Python, using PyTorch for the underlying neural network computations.

Copy-paste prompts

Prompt 1
How do I use Chatterbox to clone a voice from a 10-second audio clip and generate new speech in that voice?
Prompt 2
Show me how to add paralinguistic tags like [laugh] and [cough] to make Chatterbox-generated speech sound more natural.
Prompt 3
How do I set up Chatterbox to generate speech in French, Japanese, and Arabic for a multilingual app?
Prompt 4
What's the difference between Chatterbox-Turbo and Chatterbox-Multilingual, and which should I use for my voice agent?
Prompt 5
How does the watermarking system in Chatterbox work, and how can I verify if audio is AI-generated?
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.