Build a real-time voice assistant that can be interrupted mid-sentence and respond naturally without waiting for a turn.
Use the Mimi audio codec standalone to compress and stream 24 kHz audio with under 100ms latency in your own app.
Run a local voice conversation AI on an Apple Silicon MacBook using the MLX implementation without a cloud API.
Deploy a production voice conversation server using the Rust implementation for performance and reliability.
PyTorch version requires a GPU with at least 24GB VRAM, use the MLX version on Apple Silicon instead.
Moshi is an AI model designed for real-time spoken conversation. Unlike most voice assistants that work in a turn-taking mode, Moshi is full-duplex, meaning both sides can speak at the same time, similar to a natural phone call. It was built by Kyutai, a French AI research lab, and the model weights are released openly under a CC-BY 4.0 license. The system works by processing two audio streams simultaneously: one representing what the AI is saying and one representing what the user is saying. Alongside the audio, Moshi also generates text tokens for its own speech as part of an internal reasoning process, which the README calls an "inner monologue." This internal text prediction is described as improving the quality of what the model says. The architecture uses two neural networks working together: a smaller one that handles very short-term audio dependencies and a larger 7-billion-parameter model that handles longer-term patterns across time. A key component is Mimi, a streaming audio codec that compresses 24 kHz audio into a very compact form. Mimi is designed to work in real time with only an 80ms delay per chunk of audio. It is also released separately so developers can use it for other audio tasks. The repository contains three implementation versions. The PyTorch version is aimed at researchers who want to experiment with the code. The MLX version is optimized for running locally on Apple Silicon devices like MacBooks and iPhones. The Rust version is built for production deployments where reliability and performance matter most. You can try the model live at the demo site listed in the README. To run it yourself, a GPU with at least 24 gigabytes of memory is required for the PyTorch version.
← kyutai-labs on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.