explaingit

saganaki22/higgs_v3-tts-comfyui

21PythonAudience · vibe coderComplexity · 3/5LicenseSetup · moderate

TLDR

A ComfyUI plugin that adds text-to-speech blocks powered by Higgs Audio v3, supporting 100 languages, zero-shot voice cloning, emotion tags, multi-speaker dialogues, and long-text splitting, all running locally on your GPU.

Mindmap

mindmap
  root((higgs_v3-tts-comfyui))
    What it does
      Text to spoken audio
      Zero-shot voice cloning
      Multi-speaker dialogue
    Features
      100 language support
      Emotion and pause tags
      Long text auto-chunking
    Hardware
      CUDA GPU preferred
      11GB VRAM required
      CPU fallback slow
    Install
      Clone to custom nodes
      Run install script
      Model auto-downloads
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Convert a long article or script into natural spoken audio using a cloned voice from a short reference recording.

USE CASE 2

Build a multi-speaker audio dialogue inside ComfyUI where each character has a distinct reference voice.

USE CASE 3

Add emotion tags and sound effects inline in text to generate expressive narration for videos or games.

Tech stack

PythonPyTorchTransformersComfyUICUDA

Getting it running

Difficulty · moderate Time to first run · 30min

Requires a CUDA GPU with at least 11GB VRAM and about 9.3GB of disk space for the model checkpoint.

Research and non-commercial use only, you may not use this for commercial products, and voice cloning requires the consent of the person whose voice is used.

In plain English

This is a plugin for ComfyUI that adds text-to-speech capabilities using a model called Higgs Audio v3 from Boson AI. ComfyUI is a visual tool where you connect blocks together to build AI workflows, mostly used for image generation but extensible to audio. This plugin adds a set of those blocks specifically for turning text into spoken audio. The underlying model supports speech in roughly 100 languages, can copy the voice from a short audio sample you provide (called zero-shot voice cloning), and lets you embed special tags directly in your text to control tone, pauses, emotion, or even insert sound effects. If you have a long piece of text to convert, the plugin splits it into chunks at natural boundaries rather than cutting mid-sentence, then stitches the audio back together. You can also run multi-speaker dialogues where different speakers use different reference voices, with up to six speakers in one session. The model itself is a 4-billion-parameter checkpoint that weighs about 9.3 GB on disk and requires approximately 11 GB of video memory (VRAM) to run. The plugin handles memory tracking for the model within ComfyUI so it fits alongside other loaded models without manual management. It works on CUDA-capable graphics cards and falls back to CPU if needed, though CPU would be significantly slower. Installation involves cloning the repository into ComfyUI's custom nodes folder and running an install script. The installer does not touch your existing PyTorch or Transformers setup. The model checkpoint can be downloaded automatically on first use or placed manually in a specific folder. The plugin is tested with Transformers versions 5.3.0 through 5.5.0. Boson AI released this model for research and non-commercial use. The README explicitly notes that voice cloning should not be used without the consent of the person whose voice is being cloned. There is no paid tier or API involved, everything runs locally on your machine.

Copy-paste prompts

Prompt 1
I want to clone a voice from a 10-second audio clip and generate a 500-word narration in ComfyUI using higgs_v3-tts-comfyui. Walk me through the setup and the node connections I need.
Prompt 2
Show me how to write a multi-speaker dialogue script for higgs_v3-tts-comfyui with three speakers, each using a different reference voice, and explain the text format it expects.
Prompt 3
What emotion tags and sound effect tags does higgs_v3-tts-comfyui support, and give me an example paragraph that uses pauses, happy tone, and a door-knock sound effect.
Open on GitHub → Explain another repo

← saganaki22 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.