Generate expressive dialogue audio inside a ComfyUI workflow from a written script
Clone a voice from a 10-second reference clip and use it in a ComfyUI scene
Add laughs, sighs, and emotion cues to TTS output for short animation or game prototypes
Needs a 24GB NVIDIA GPU, CUDA 12+, 17GB disk for model weights, and the first run downloads weights from HuggingFace.
ComfyUI-DramaBox is a small add-on for ComfyUI, the popular browser-based workflow tool used for image and audio generation. It wraps a text-to-speech model called DramaBox, made by ResembleAI, so that ComfyUI users can generate spoken audio from typed scene descriptions without leaving the app. The DramaBox model itself is described as expressive: it does not just read text in a flat voice, it can produce laughs, sighs, pauses, voice cracks, and other dramatic moments based on cues in the prompt. It also supports voice cloning, meaning if you upload a short reference clip of about ten seconds, the generated speech will try to match that speaker's voice. The output comes out as standard ComfyUI audio that can be sent to the Preview Audio or Save Audio nodes already in ComfyUI. The hardware bar is high. You need an NVIDIA GPU with about 24 GB of video memory, CUDA 12 or newer, and about 17 GB of free disk space for the model files. Installation is either through the ComfyUI Manager by searching for the node name, or by cloning the repository into the custom_nodes folder and running pip install. The first time you generate audio, the node downloads the model weights from HuggingFace by itself: a transformer file, an audio components file, and a 4-bit version of a Gemma 3 text encoder. The prompt format is unusual. Anything inside quotes is what the model speaks aloud, including phonetic laughs like Hahaha. Anything outside quotes is treated as stage direction that shapes how the next line is delivered, such as She sighs deeply or His voice rises with fury. The README warns that the first generation takes several minutes while models load into memory, but later runs on an H100 GPU take only about two or three seconds.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.