iamdinhthuan/vizipvoice

★ 12PythonAudience · researcherComplexity · 3/5Setup · moderate

Mindmap

mindmap
  root((ViZipVoice))
    What it does
      Vietnamese TTS
      Zero-shot voice cloning
      Audio generation
    Training data
      6500h Vietnamese audio
      500h English audio
      24 kHz output
    Interfaces
      Command line tool
      Gradio web UI
      Python wrapper class
    Setup
      Hugging Face weights
      30 sample speakers
      pip install

mindmap root((ViZipVoice)) What it does Vietnamese TTS Zero-shot voice cloning Audio generation Training data 6500h Vietnamese audio 500h English audio 24 kHz output Interfaces Command line tool Gradio web UI Python wrapper class Setup Hugging Face weights 30 sample speakers pip install

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Generate Vietnamese voiceovers in a specific person's voice using a short reference audio clip.

USE CASE 2

Build a Vietnamese audiobook reader by passing chapter text to the Python wrapper class.

USE CASE 3

Use the Gradio web interface to demo Vietnamese TTS with one of the 30 included reference speakers.

Tech stack

PythonGradioHugging Face

Getting it running

Difficulty · moderate Time to first run · 30min

Install from source via pip, model weights download automatically from Hugging Face on first run.

In plain English

ViZipVoice is a Vietnamese text-to-speech model built by fine-tuning ZipVoice, an existing open-source speech synthesis system. Its main capability is zero-shot voice cloning: you give it a short audio recording of a speaker along with a text transcript of that recording, and then it can generate new Vietnamese speech in that same voice from any text you provide. The model was trained on roughly 7,000 hours of audio, about 6,500 hours of Vietnamese and 500 hours of English. It works at a 24 kHz sample rate and uses a character-level tokenizer with 244 tokens covering Vietnamese characters, including all accented forms, digits, and punctuation. Rather than converting text to phonemes first, the system maps characters directly. A Vietnamese text normalization step runs automatically before synthesis to convert numbers, dates, abbreviations, and units into spoken form. The model weights are hosted on Hugging Face and download automatically when you run the tool. There are 30 sample reference audio files included in the Hugging Face repository, each paired with a transcript file, which you can use directly as voice prompts without recording your own sample. Demo outputs generated with the model are also included. You can use ViZipVoice through three interfaces. The command-line tool takes a prompt audio file, its transcript, and the target text, then writes an output WAV file. A Gradio web interface lets you select from the included reference speakers and type text to synthesize through a browser. A Python wrapper class is available for integrating the model into your own code. The README covers installation from source using pip, quality tips for prompt audio (clean recording, correct transcript, one speaker, minimal background noise), and parameters for controlling synthesis speed, number of diffusion steps, and audio postprocessing such as crossfade and silence between segments.

Copy-paste prompts

Prompt 1

Using ViZipVoice, generate a Vietnamese audio clip of this text in the voice of included sample speaker number 5.

Prompt 2

Write a Python script using the ViZipVoice wrapper class to convert a list of Vietnamese sentences into WAV files with a consistent voice and speed setting.

Prompt 3

Set up the ViZipVoice Gradio interface locally and test synthesis quality on a short Vietnamese paragraph with three different reference speakers.

Open on GitHub → Explain another repo

← iamdinhthuan on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.