Convert your own vocal recordings to sound like a trained target voice without altering the melody or pitch.
Train a custom voice model from a collection of audio recordings of a specific singer.
Generate synthetic vocal content for a music project using a one-click Colab notebook without a local GPU.
Experiment with pitch-shifted voice conversion by adjusting the semitone offset at inference time.
48 kHz inference requires significant GPU memory, use the 32 kHz branch on smaller GPUs. Start from Hugging Face pre-trained weights to avoid convergence failures.
so-vits-svc is a Python project for singing voice conversion, which means it takes a recording of someone singing and converts it to sound like a different person's voice. It uses two AI components: SoftVC, which extracts the vocal characteristics from the source audio, and VITS, a speech synthesis model repurposed here to produce the converted voice. Pitch is also fed into the model separately so the melody is preserved during the transfer. To train a model for a specific target voice, you collect audio recordings of that person and organize them in folders by speaker name. Preprocessing steps resample the audio to 48 kHz, split it into training and validation sets, and extract pitch features and vocal content representations. Training then runs until you are satisfied with the quality. During inference, you place an audio file in the raw folder, specify the target voice name and how many semitones to shift the pitch, and run the inference script to get the converted output. The project includes a one-click Google Colab notebook for users who want to prepare data and train without setting up a local Python environment. Pre-trained base model weights are available for download from Hugging Face, and the README recommends starting from these rather than training from scratch, since starting from zero has a risk of not converging. The README, written primarily in Chinese, includes usage rules noting that users are responsible for ensuring they have rights to the audio used for training, and that videos made with the tool must clearly credit the original audio source. The 48 kHz inference mode requires significant GPU memory, the README suggests switching to a 32 kHz branch if memory runs out. Training on more than ten speakers may cause voice quality to degrade.
← innnky on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.