jianchang512/clone-voice

★ 8,950PythonAudience · generalComplexity · 3/5LicenseSetup · hard

Mindmap

mindmap
  root((clone-voice))
    What it does
      Text to cloned speech
      Audio re-dubbing
      16 languages
    Tech stack
      Python
      xtts_v2 model
      CUDA optional
    Setup paths
      Windows precompiled
      Linux source install
      Hugging Face models
    Audience
      Content creators
      No-code users

mindmap root((clone-voice)) What it does Text to cloned speech Audio re-dubbing 16 languages Tech stack Python xtts_v2 model CUDA optional Setup paths Windows precompiled Linux source install Hugging Face models Audience Content creators No-code users

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Clone someone's voice from a short recording and generate new spoken text in that voice.

USE CASE 2

Re-dub an existing audio clip so it plays back in a different person's voice.

USE CASE 3

Produce speech in a cloned voice across 16 supported languages including English, Chinese, and Japanese.

Tech stack

Pythonxtts_v2Hugging FaceCUDA

Getting it running

Difficulty · hard Time to first run · 1h+

Requires downloading ~3GB of model files from Hugging Face, users in China need a working proxy to reach Hugging Face.

Licensed for personal learning and research only via xtts_v2, commercial use is not permitted.

In plain English

Clone Voice is a voice cloning tool with a browser-based interface that lets you take a short audio recording of any person's voice and use it to generate new speech. You can either type text and have it spoken in the cloned voice, or take an existing audio clip and re-produce it in that voice. The README is written primarily in Chinese, with an English version linked separately. The tool is built on a speech synthesis model called xtts_v2, developed by coqui.ai, which is licensed for personal learning and research only, not for commercial use. It supports 16 languages including Chinese, English, Japanese, Korean, French, German, and Italian. The README notes that English output quality is good and Chinese quality is acceptable. For Windows users, a precompiled version is available as a downloadable package. You double-click an executable file, wait for a web page to open automatically, and then use the interface by clicking through the options. The model files, which are roughly 3 gigabytes, need to be downloaded and placed in a specific folder. No coding is required for the precompiled path. For users on Linux or macOS, or those who want to run from source, the process involves Python 3.9 through 3.11, setting up a virtual environment, installing dependencies, and downloading the model files from Hugging Face, which requires a working proxy connection for users in China since those services are blocked there. The README includes detailed troubleshooting notes for proxy-related failures, which it identifies as the most common source of errors. If the machine has an Nvidia GPU, CUDA acceleration can be enabled for faster processing. The same developer also maintains related tools for video translation with dubbing, speech-to-text transcription, and vocal separation from background music.

Copy-paste prompts

Prompt 1

I have a 10-second audio clip of my voice. Help me set up clone-voice to generate a new audio file where it says 'Welcome to my channel' in my voice.

Prompt 2

I'm running clone-voice on Windows and the model files aren't being found. Walk me through placing the 3GB xtts_v2 model files in the correct folder.

Prompt 3

Help me set up clone-voice on macOS using Python 3.10, including creating the virtual environment and downloading model files from Hugging Face.

Prompt 4

I want to re-dub an English audio clip into Japanese using clone-voice. What settings do I need and what output quality can I expect?

Open on GitHub → Explain another repo

← jianchang512 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.