explaingit

nari-labs/dia

19,291PythonAudience · developerComplexity · 3/5QuietLicenseSetup · hard

TLDR

Open-weight AI model that generates realistic multi-speaker dialogue audio from scripts, with voice cloning and natural sounds like laughter and sighing.

Mindmap

mindmap
  root((Dia))
    What it does
      Multi-speaker dialogue
      Voice cloning
      Natural sounds
      Emotion control
    Tech stack
      Python
      PyTorch
      CUDA
      Hugging Face
    Use cases
      Podcast generation
      Interactive voice apps
      Dialogue content
      Voice acting replacement
    Audience
      Audio developers
      Content creators
      Product builders

Things people build with this

USE CASE 1

Build a podcast generator that creates multi-speaker conversations from scripts without hiring voice actors.

USE CASE 2

Create interactive voice apps where characters have distinct, cloneable voices that respond naturally to user input.

USE CASE 3

Generate dialogue-heavy content like audiobook chapters, radio dramas, or interview simulations with realistic speaker variation.

USE CASE 4

Prototype voice-based products with emotion and tone control by conditioning the model on audio prompts.

Tech stack

PythonPyTorchCUDAHugging Face Transformers

Getting it running

Difficulty · hard Time to first run · 1h+

Requires CUDA/GPU setup, large model downloads from Hugging Face, and PyTorch compilation.

Use freely for any purpose including commercial. Keep the notice and disclose changes to the patent grant.

In plain English

Dia is an open-weight text-to-speech AI model built by Nari Labs that specializes in generating realistic multi-speaker dialogue from a written script. Unlike typical text-to-speech tools that synthesize a single narrator voice, Dia is designed to produce back-and-forth conversations with two distinct speakers, complete with natural nonverbal sounds like laughter, coughing, sighing, and gasping. You give it a script with speaker tags like [S1] and [S2], and it outputs audio that sounds like a real two-person conversation. It also supports voice cloning, you can provide a short audio sample and Dia will match that voice's tone and style. Emotion and tone can be steered by conditioning the model on an audio prompt. You'd use Dia if you're building a podcast generator, dialogue-based content, interactive voice apps, or any product that needs lifelike multi-speaker audio without expensive voice actors. The model has 1.6 billion parameters, runs on NVIDIA GPUs, supports English only at the moment, and is available through Hugging Face Transformers. The tech stack is Python, with PyTorch and CUDA required for inference.

Copy-paste prompts

Prompt 1
How do I set up Dia to generate a two-speaker conversation from a script with [S1] and [S2] tags?
Prompt 2
Show me how to clone a voice in Dia using a short audio sample and apply it to a dialogue script.
Prompt 3
What's the process for conditioning Dia's output on an emotion or tone using an audio prompt?
Prompt 4
How do I run Dia inference on an NVIDIA GPU using PyTorch and what are the minimum hardware requirements?
Prompt 5
Can you walk me through generating a podcast episode with multiple speakers using Dia's Hugging Face model?
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.