explaingit

innnky/so-vits-svc

3,786PythonAudience · generalComplexity · 4/5Setup · hard

TLDR

A Python tool that converts a singing recording to sound like a different person's voice using AI, preserving the original melody and pitch, train it on audio samples of any target voice.

Mindmap

mindmap
  root((so-vits-svc))
    What it does
      Singing voice conversion
      Melody preservation
      Pitch shifting
    AI components
      SoftVC feature extraction
      VITS synthesis model
      Pitch input channel
    Workflow
      Audio preprocessing
      Speaker model training
      Inference output
    Setup options
      Colab one-click notebook
      Local GPU training
      Hugging Face weights
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Convert your own vocal recordings to sound like a trained target voice without altering the melody or pitch.

USE CASE 2

Train a custom voice model from a collection of audio recordings of a specific singer.

USE CASE 3

Generate synthetic vocal content for a music project using a one-click Colab notebook without a local GPU.

USE CASE 4

Experiment with pitch-shifted voice conversion by adjusting the semitone offset at inference time.

Tech stack

PythonPyTorchVITSSoftVCHugging Face

Getting it running

Difficulty · hard Time to first run · 1h+

48 kHz inference requires significant GPU memory, use the 32 kHz branch on smaller GPUs. Start from Hugging Face pre-trained weights to avoid convergence failures.

In plain English

so-vits-svc is a Python project for singing voice conversion, which means it takes a recording of someone singing and converts it to sound like a different person's voice. It uses two AI components: SoftVC, which extracts the vocal characteristics from the source audio, and VITS, a speech synthesis model repurposed here to produce the converted voice. Pitch is also fed into the model separately so the melody is preserved during the transfer. To train a model for a specific target voice, you collect audio recordings of that person and organize them in folders by speaker name. Preprocessing steps resample the audio to 48 kHz, split it into training and validation sets, and extract pitch features and vocal content representations. Training then runs until you are satisfied with the quality. During inference, you place an audio file in the raw folder, specify the target voice name and how many semitones to shift the pitch, and run the inference script to get the converted output. The project includes a one-click Google Colab notebook for users who want to prepare data and train without setting up a local Python environment. Pre-trained base model weights are available for download from Hugging Face, and the README recommends starting from these rather than training from scratch, since starting from zero has a risk of not converging. The README, written primarily in Chinese, includes usage rules noting that users are responsible for ensuring they have rights to the audio used for training, and that videos made with the tool must clearly credit the original audio source. The 48 kHz inference mode requires significant GPU memory, the README suggests switching to a 32 kHz branch if memory runs out. Training on more than ten speakers may cause voice quality to degrade.

Copy-paste prompts

Prompt 1
I have 30 minutes of singing recordings. Walk me through the so-vits-svc preprocessing, training, and inference steps to convert a new song to sound like that voice.
Prompt 2
How do I use so-vits-svc in Google Colab with the one-click notebook? What do I need to upload and what output files does the notebook produce?
Prompt 3
I downloaded a so-vits-svc pre-trained model from Hugging Face. How do I run inference on a WAV file, shift the pitch up 3 semitones, and save the result?
Prompt 4
My so-vits-svc inference is running out of GPU memory at 48 kHz. How do I switch to the 32 kHz branch and what quality difference should I expect?
Open on GitHub → Explain another repo

← innnky on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.