jasonppy/voicecraft

★ 8,484Jupyter NotebookAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((voicecraft))
    What it does
      Voice cloning
      Speech editing
      Text to speech
    Input sources
      Audiobooks
      Podcasts
      YouTube audio
    Setup options
      Google Colab
      Docker container
      Local GPU install
    Model sizes
      330M parameters
      830M parameters

mindmap root((voicecraft)) What it does Voice cloning Speech editing Text to speech Input sources Audiobooks Podcasts YouTube audio Setup options Google Colab Docker container Local GPU install Model sizes 330M parameters 830M parameters

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Clone a speaker's voice from a podcast clip and generate new sentences that sound like that person

USE CASE 2

Edit an audiobook recording to fix a mispronounced word without re-recording the whole passage

USE CASE 3

Build a voice-over tool that generates spoken audio in a custom voice from a written text script

Tech stack

PythonPyTorchCUDAJupyter NotebookDockerGradio

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a CUDA-capable NVIDIA GPU, conda, Montreal Forced Aligner, and several audio processing libraries, Google Colab is the easiest no-install option.

In plain English

VoiceCraft is an AI system that can edit existing speech recordings or generate new speech from text, using only a short sample of a person's voice as a reference. If you give it a few seconds of audio, it can produce new spoken words that sound like the same person, or it can modify what was already said in a recording. The project describes this as working on real-world audio sources like audiobooks, YouTube videos, and podcasts, not just controlled studio recordings. The underlying approach is a type of AI model that works by predicting missing pieces of audio, similar to how some text AI models fill in blanks within a sentence. Two model sizes are available on HuggingFace: a 330 million parameter version and a larger 830 million parameter version, with enhanced variants of both released in April 2024. There are several ways to try it. The easiest is a Google Colab notebook that runs in a browser without any local installation. A Docker-based option is also available for those comfortable with containers. For local installation, setup requires Python, conda, and a CUDA-capable NVIDIA graphics card. The setup process installs a number of audio processing libraries and a forced-alignment tool called Montreal Forced Aligner, which helps the model match text to the timing of audio. A Gradio web interface can be run locally or accessed through HuggingFace Spaces. The repository includes Jupyter notebooks for both text-to-speech inference and speech editing, plus command-line scripts for integrating the model into other projects. Training and finetuning instructions are also included for those who want to adapt the model to different voices or datasets. This is a research project backed by a published academic paper. It is primarily aimed at researchers and developers working in audio, though the Colab and HuggingFace demos make it accessible to anyone curious about AI voice generation.

Copy-paste prompts

Prompt 1

Using VoiceCraft, how do I generate a new sentence in someone's voice given a 5-second reference audio clip?

Prompt 2

How do I run VoiceCraft locally on my NVIDIA GPU to edit an existing speech recording and change one word?

Prompt 3

Set up VoiceCraft with Docker and show me how to use the Gradio web interface to clone a voice from an audio file

Prompt 4

Write a Python script using VoiceCraft to convert a paragraph of text into speech using a reference voice sample I provide

Open on GitHub → Explain another repo

← jasonppy on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.