Build a podcast generator that creates multi-speaker conversations from scripts without hiring voice actors.
Create interactive voice apps where characters have distinct, cloneable voices that respond naturally to user input.
Generate dialogue-heavy content like audiobook chapters, radio dramas, or interview simulations with realistic speaker variation.
Prototype voice-based products with emotion and tone control by conditioning the model on audio prompts.
Requires CUDA/GPU setup, large model downloads from Hugging Face, and PyTorch compilation.
Dia is an open-weight text-to-speech AI model built by Nari Labs that specializes in generating realistic multi-speaker dialogue from a written script. Unlike typical text-to-speech tools that synthesize a single narrator voice, Dia is designed to produce back-and-forth conversations with two distinct speakers, complete with natural nonverbal sounds like laughter, coughing, sighing, and gasping. You give it a script with speaker tags like [S1] and [S2], and it outputs audio that sounds like a real two-person conversation. It also supports voice cloning, you can provide a short audio sample and Dia will match that voice's tone and style. Emotion and tone can be steered by conditioning the model on an audio prompt. You'd use Dia if you're building a podcast generator, dialogue-based content, interactive voice apps, or any product that needs lifelike multi-speaker audio without expensive voice actors. The model has 1.6 billion parameters, runs on NVIDIA GPUs, supports English only at the moment, and is available through Hugging Face Transformers. The tech stack is Python, with PyTorch and CUDA required for inference.
Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.