explaingit

kyutai-labs/moshi

10,198PythonAudience · researcherComplexity · 5/5LicenseSetup · hard

TLDR

An open AI model for real-time full-duplex voice conversation, both sides can speak at once like a phone call, with implementations for research, Apple Silicon devices, and production deployments.

Mindmap

mindmap
  root((repo))
    What it does
      Full-duplex voice chat
      Real-time conversation
      Inner monologue text
    Architecture
      7B parameter model
      Mimi audio codec
      Dual audio streams
      80ms latency chunks
    Implementations
      PyTorch research
      MLX Apple Silicon
      Rust production
    Requirements
      24GB GPU for PyTorch
      Apple Silicon for MLX
      CC-BY 4.0 license
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Build a real-time voice assistant that can be interrupted mid-sentence and respond naturally without waiting for a turn.

USE CASE 2

Use the Mimi audio codec standalone to compress and stream 24 kHz audio with under 100ms latency in your own app.

USE CASE 3

Run a local voice conversation AI on an Apple Silicon MacBook using the MLX implementation without a cloud API.

USE CASE 4

Deploy a production voice conversation server using the Rust implementation for performance and reliability.

Tech stack

PythonPyTorchMLXRust

Getting it running

Difficulty · hard Time to first run · 1h+

PyTorch version requires a GPU with at least 24GB VRAM, use the MLX version on Apple Silicon instead.

CC-BY 4.0, use and build on the model weights freely including for commercial purposes, as long as you credit Kyutai.

In plain English

Moshi is an AI model designed for real-time spoken conversation. Unlike most voice assistants that work in a turn-taking mode, Moshi is full-duplex, meaning both sides can speak at the same time, similar to a natural phone call. It was built by Kyutai, a French AI research lab, and the model weights are released openly under a CC-BY 4.0 license. The system works by processing two audio streams simultaneously: one representing what the AI is saying and one representing what the user is saying. Alongside the audio, Moshi also generates text tokens for its own speech as part of an internal reasoning process, which the README calls an "inner monologue." This internal text prediction is described as improving the quality of what the model says. The architecture uses two neural networks working together: a smaller one that handles very short-term audio dependencies and a larger 7-billion-parameter model that handles longer-term patterns across time. A key component is Mimi, a streaming audio codec that compresses 24 kHz audio into a very compact form. Mimi is designed to work in real time with only an 80ms delay per chunk of audio. It is also released separately so developers can use it for other audio tasks. The repository contains three implementation versions. The PyTorch version is aimed at researchers who want to experiment with the code. The MLX version is optimized for running locally on Apple Silicon devices like MacBooks and iPhones. The Rust version is built for production deployments where reliability and performance matter most. You can try the model live at the demo site listed in the README. To run it yourself, a GPU with at least 24 gigabytes of memory is required for the PyTorch version.

Copy-paste prompts

Prompt 1
Set up the Moshi PyTorch version on a machine with a 24GB GPU and start a real-time voice conversation session.
Prompt 2
How do I use the Mimi audio codec from kyutai-labs/moshi to stream and compress audio in my Python app with under 100ms latency?
Prompt 3
Run Moshi on an Apple Silicon MacBook using the MLX build, what dependencies do I need and how do I start a session?
Prompt 4
Using the Moshi Rust implementation, deploy a production voice conversation server and connect a client app to it.
Prompt 5
How does Moshi's inner monologue work, and can I log or inspect the text tokens it generates during a live conversation?
Open on GitHub → Explain another repo

← kyutai-labs on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.