Convert a long article or script into natural spoken audio using a cloned voice from a short reference recording.
Build a multi-speaker audio dialogue inside ComfyUI where each character has a distinct reference voice.
Add emotion tags and sound effects inline in text to generate expressive narration for videos or games.
Requires a CUDA GPU with at least 11GB VRAM and about 9.3GB of disk space for the model checkpoint.
This is a plugin for ComfyUI that adds text-to-speech capabilities using a model called Higgs Audio v3 from Boson AI. ComfyUI is a visual tool where you connect blocks together to build AI workflows, mostly used for image generation but extensible to audio. This plugin adds a set of those blocks specifically for turning text into spoken audio. The underlying model supports speech in roughly 100 languages, can copy the voice from a short audio sample you provide (called zero-shot voice cloning), and lets you embed special tags directly in your text to control tone, pauses, emotion, or even insert sound effects. If you have a long piece of text to convert, the plugin splits it into chunks at natural boundaries rather than cutting mid-sentence, then stitches the audio back together. You can also run multi-speaker dialogues where different speakers use different reference voices, with up to six speakers in one session. The model itself is a 4-billion-parameter checkpoint that weighs about 9.3 GB on disk and requires approximately 11 GB of video memory (VRAM) to run. The plugin handles memory tracking for the model within ComfyUI so it fits alongside other loaded models without manual management. It works on CUDA-capable graphics cards and falls back to CPU if needed, though CPU would be significantly slower. Installation involves cloning the repository into ComfyUI's custom nodes folder and running an install script. The installer does not touch your existing PyTorch or Transformers setup. The model checkpoint can be downloaded automatically on first use or placed manually in a specific folder. The plugin is tested with Transformers versions 5.3.0 through 5.5.0. Boson AI released this model for research and non-commercial use. The README explicitly notes that voice cloning should not be used without the consent of the person whose voice is being cloned. There is no paid tier or API involved, everything runs locally on your machine.
← saganaki22 on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.