explaingit

kat3ri/comfyui-dramabox

15PythonAudience · vibe coderComplexity · 3/5ActiveSetup · hard

TLDR

ComfyUI custom node that wraps ResembleAI DramaBox, an expressive text-to-speech model with voice cloning, stage directions, laughs, and sighs.

Mindmap

mindmap
  root((ComfyUI-DramaBox))
    Inputs
      Quoted dialogue text
      Stage directions
      Reference voice clip
    Outputs
      Spoken audio waveform
      Compatible with Save Audio node
    Use Cases
      Voice acting prototypes
      Scene narration
      Voice cloning
    Tech Stack
      Python
      ComfyUI
      CUDA
      HuggingFace
      PyTorch

Things people build with this

USE CASE 1

Generate expressive dialogue audio inside a ComfyUI workflow from a written script

USE CASE 2

Clone a voice from a 10-second reference clip and use it in a ComfyUI scene

USE CASE 3

Add laughs, sighs, and emotion cues to TTS output for short animation or game prototypes

Tech stack

PythonComfyUICUDAPyTorchHuggingFace

Getting it running

Difficulty · hard Time to first run · 1h+

Needs a 24GB NVIDIA GPU, CUDA 12+, 17GB disk for model weights, and the first run downloads weights from HuggingFace.

In plain English

ComfyUI-DramaBox is a small add-on for ComfyUI, the popular browser-based workflow tool used for image and audio generation. It wraps a text-to-speech model called DramaBox, made by ResembleAI, so that ComfyUI users can generate spoken audio from typed scene descriptions without leaving the app. The DramaBox model itself is described as expressive: it does not just read text in a flat voice, it can produce laughs, sighs, pauses, voice cracks, and other dramatic moments based on cues in the prompt. It also supports voice cloning, meaning if you upload a short reference clip of about ten seconds, the generated speech will try to match that speaker's voice. The output comes out as standard ComfyUI audio that can be sent to the Preview Audio or Save Audio nodes already in ComfyUI. The hardware bar is high. You need an NVIDIA GPU with about 24 GB of video memory, CUDA 12 or newer, and about 17 GB of free disk space for the model files. Installation is either through the ComfyUI Manager by searching for the node name, or by cloning the repository into the custom_nodes folder and running pip install. The first time you generate audio, the node downloads the model weights from HuggingFace by itself: a transformer file, an audio components file, and a 4-bit version of a Gemma 3 text encoder. The prompt format is unusual. Anything inside quotes is what the model speaks aloud, including phonetic laughs like Hahaha. Anything outside quotes is treated as stage direction that shapes how the next line is delivered, such as She sighs deeply or His voice rises with fury. The README warns that the first generation takes several minutes while models load into memory, but later runs on an H100 GPU take only about two or three seconds.

Copy-paste prompts

Prompt 1
Install ComfyUI-DramaBox on my ComfyUI setup and route its output to the Save Audio node
Prompt 2
Write a DramaBox prompt for a 3-line argument scene using stage directions and quoted dialogue
Prompt 3
Build a ComfyUI workflow that pairs ComfyUI-DramaBox voice output with a video generator on the same scene text
Prompt 4
Lower the VRAM use of ComfyUI-DramaBox below 24GB by enabling offloading or quantization where possible
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.