explaingit

sesameailabs/csm

Analysis updated 2026-06-24

14,621PythonAudience · researcherComplexity · 5/5Setup · hard

TLDR

Sesame's Conversational Speech Model, a 1B Llama-style backbone plus Mimi audio decoder that generates speech from text and audio context, runnable locally on a CUDA GPU.

Mindmap

mindmap
  root((csm))
    Inputs
      Text prompt
      Speaker ID
      Audio context segments
    Outputs
      Generated speech
      Mimi audio codes
      Wav files
    Use Cases
      Research speech synthesis
      Generate dialogue audio
      Prototype voice agents
    Tech Stack
      Python
      PyTorch
      CUDA
      HuggingFace
      Mimi
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Generate dialogue audio between two speakers from text on a local CUDA GPU

USE CASE 2

Provide past audio segments as context so a new utterance matches a conversation

USE CASE 3

Experiment with a Llama-backbone speech model for research on voice generation

USE CASE 4

Try the model in a browser via the hosted Hugging Face space without local setup

What is it built with?

PythonPyTorchCUDAHuggingFaceMimi

How does it compare?

sesameailabs/csmmicrosoft/playwright-pythonnltk/nltk
Stars14,62114,62914,613
LanguagePythonPythonPython
Setup difficultyhardmoderateeasy
Complexity5/53/53/5
Audienceresearcherdeveloperresearcher

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Needs a CUDA GPU, ffmpeg, and Hugging Face access to two gated models, and on Windows you must replace triton with triton-windows.

In plain English

CSM stands for Conversational Speech Model. It is an AI model from a company called Sesame that turns text and audio inputs into generated speech. The README explains that internally CSM uses a Llama-style language model as its backbone and a smaller audio decoder that produces audio codes in the Mimi format, a separate audio compression format published by another team called Kyutai. A fine-tuned version of the same model powers the interactive voice demo on Sesame's website and is the subject of their research blog post about crossing what they call the uncanny valley of voice. A news section at the top notes two milestones. In March 2025 the 1B-parameter variant was released and uploaded to Hugging Face. In May 2025, version 4.52.1 of the Hugging Face Transformers library added native support for CSM, which means users no longer need this repository to run the model from Python. A hosted Hugging Face space is also available for trying it in a browser without any local setup. To run the code in this repository, the README lists a CUDA-compatible GPU as required, tested on CUDA 12.4 and 12.6. Python 3.10 is recommended, ffmpeg may be needed for some audio steps, and the user has to be granted access on Hugging Face to both Llama-3.2-1B and CSM-1B, since both are gated models. The setup commands clone the repo, create a virtual environment, install requirements, set an environment variable to turn off lazy compilation in the Mimi decoder, and log into Hugging Face. Windows users are told to swap the standard triton package for triton-windows. The Quickstart is a single command, python run_csm.py, which generates a conversation between two characters using built-in prompts. The Usage section then shows two Python snippets for writing your own programs. The first generates one short sentence with a random speaker identity by calling load_csm_1b and generator.generate with the text and a speaker ID. The second shows how to give the model context: a list of past Segments, each with text, speaker number, and audio, is passed in so the new utterance sounds like it belongs in the same conversation. The FAQ is candid. The released model is a base model with no pre-trained specific voices, cannot generate text and is not meant to chat, and only partly handles non-English languages because of incidental training data. A Misuse section prohibits impersonation, misinformation, and any illegal use of the model.

Copy-paste prompts

Prompt 1
Give me a 5-minute setup guide for SesameAILabs csm on a CUDA 12.4 box including the Hugging Face login step
Prompt 2
Write a Python script that uses load_csm_1b and generator.generate to produce one sentence with a chosen speaker ID
Prompt 3
Show me how to pass past Segments as context so a generated utterance sounds like the same conversation
Prompt 4
Adapt csm to run on Windows including swapping triton for triton-windows and any other gotchas
Prompt 5
Explain how the Llama backbone and the Mimi decoder fit together in csm and what the speaker ID controls

Frequently asked questions

What is csm?

Sesame's Conversational Speech Model, a 1B Llama-style backbone plus Mimi audio decoder that generates speech from text and audio context, runnable locally on a CUDA GPU.

What language is csm written in?

Mainly Python. The stack also includes Python, PyTorch, CUDA.

How hard is csm to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is csm for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.