explaingit

sesameailabs/csm

14,621Python

TLDR

CSM stands for Conversational Speech Model.

Mindmap

A visual breakdown will appear here once this repo is fully enriched.

In plain English

CSM stands for Conversational Speech Model. It is an AI model from a company called Sesame that turns text and audio inputs into generated speech. The README explains that internally CSM uses a Llama-style language model as its backbone and a smaller audio decoder that produces audio codes in the Mimi format, a separate audio compression format published by another team called Kyutai. A fine-tuned version of the same model powers the interactive voice demo on Sesame's website and is the subject of their research blog post about crossing what they call the uncanny valley of voice. A news section at the top notes two milestones. In March 2025 the 1B-parameter variant was released and uploaded to Hugging Face. In May 2025, version 4.52.1 of the Hugging Face Transformers library added native support for CSM, which means users no longer need this repository to run the model from Python. A hosted Hugging Face space is also available for trying it in a browser without any local setup. To run the code in this repository, the README lists a CUDA-compatible GPU as required, tested on CUDA 12.4 and 12.6. Python 3.10 is recommended, ffmpeg may be needed for some audio steps, and the user has to be granted access on Hugging Face to both Llama-3.2-1B and CSM-1B, since both are gated models. The setup commands clone the repo, create a virtual environment, install requirements, set an environment variable to turn off lazy compilation in the Mimi decoder, and log into Hugging Face. Windows users are told to swap the standard triton package for triton-windows. The Quickstart is a single command, python run_csm.py, which generates a conversation between two characters using built-in prompts. The Usage section then shows two Python snippets for writing your own programs. The first generates one short sentence with a random speaker identity by calling load_csm_1b and generator.generate with the text and a speaker ID. The second shows how to give the model context: a list of past Segments, each with text, speaker number, and audio, is passed in so the new utterance sounds like it belongs in the same conversation. The FAQ is candid. The released model is a base model with no pre-trained specific voices, cannot generate text and is not meant to chat, and only partly handles non-English languages because of incidental training data. A Misuse section prohibits impersonation, misinformation, and any illegal use of the model.

Open on GitHub → Explain another repo

Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.