fishaudio/bert-vits2

★ 8,745PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((repo))
    What it does
      Text to speech
      Multilingual output
      Natural prosody
    Architecture
      VITS2 speech model
      Multilingual BERT
      End to end synthesis
    Training
      Web preprocess UI
      Custom voice data
    Status
      No longer maintained
      Use Fish-Speech instead

mindmap root((repo)) What it does Text to speech Multilingual output Natural prosody Architecture VITS2 speech model Multilingual BERT End to end synthesis Training Web preprocess UI Custom voice data Status No longer maintained Use Fish-Speech instead

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Generate natural-sounding speech audio from text in multiple languages using a pre-trained Bert-VITS2 model.

USE CASE 2

Prepare your own voice training dataset using the included web preprocessing interface to create a custom voice.

USE CASE 3

Study how combining BERT language embeddings with a VITS2 speech model improves prosody and naturalness.

USE CASE 4

Use as a reference implementation before migrating to the team's actively maintained Fish-Speech replacement.

Tech stack

PythonPyTorch

Getting it running

Difficulty · hard Time to first run · 1day+

Requires GPU hardware and training data preparation, the project is no longer maintained, Fish-Speech is the recommended replacement.

License information is not stated, usage is restricted and must not violate Chinese law or be used for political purposes.

In plain English

Bert-VITS2 is a text-to-speech system that combines two components: VITS2, a neural network architecture for generating speech audio from text, and a multilingual BERT model, which provides deeper language understanding to improve how the generated voice sounds. The idea is that BERT can better interpret the meaning and context of text, helping produce more natural-sounding output compared to running VITS2 alone. VITS is an end-to-end speech synthesis approach that generates audio directly from text input. The VITS2 variant improved on the original, and Bert-VITS2 extends it further by feeding BERT embeddings into the model so it has a richer understanding of what it is being asked to say. The multilingual BERT component means the system can work across different languages without needing completely separate models for each one. A web-based preprocessing interface is included (webui_preprocess.py) to help with preparing training data. Beyond pointing to that script, the README is brief and does not go into detailed usage instructions. The README notes that this project is no longer actively maintained. The same team has released a newer project called Fish-Speech, which they describe as the current recommended replacement. Users starting fresh are advised to use Fish-Speech instead. The project is written in Python. The README includes strict usage restrictions prohibiting use for any purpose that would violate Chinese law or for any political purpose.

Copy-paste prompts

Prompt 1

I want to use a pre-trained Bert-VITS2 model to convert a text passage to speech in Chinese. Show me how to load the model and run inference in Python.

Prompt 2

Walk me through using Bert-VITS2's webui_preprocess.py to prepare a set of audio recordings as training data for a custom voice model.

Prompt 3

Explain the architecture of Bert-VITS2: how does the BERT embedding feed into the VITS2 model and what does each component contribute to the final audio output?

Prompt 4

I want to migrate a Bert-VITS2 workflow to Fish-Speech. What are the key differences and how do I get started with Fish-Speech instead?

Open on GitHub → Explain another repo

← fishaudio on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.