explaingit

microsoft/vibevoice

Analysis updated 2026-06-20

46,676PythonAudience · researcherComplexity · 4/5Setup · hard

TLDR

VibeVoice is a set of open AI models from Microsoft that can transcribe up to 60 minutes of audio in one pass, including who said what and when, and generate natural-sounding speech, all available to download and run yourself.

Mindmap

mindmap
  root((vibevoice))
    What it does
      Speech to text
      Text to speech
      Speaker identification
    ASR features
      60 min single pass
      Speaker diarization
      Timestamps
      Custom vocabulary
    Tech stack
      Python
      Hugging Face
      vLLM inference
    Who it helps
      Researchers
      Developers
      Transcription services
    Limitations
      TTS removed 2025
      GPU recommended
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Transcribe a full hour-long meeting recording with speaker labels and timestamps in a single pass

USE CASE 2

Build a podcast transcription service that identifies different speakers automatically

USE CASE 3

Fine-tune the speech recognition model on domain-specific audio with custom vocabulary

USE CASE 4

Stream real-time text-to-speech output for an application using the remaining TTS variant

What is it built with?

PythonHugging Face TransformersvLLM

How does it compare?

microsoft/vibevoiceoobabooga/textgend4vinci/scrapling
Stars46,67646,94546,073
LanguagePythonPythonPython
Setup difficultyhardmoderatemoderate
Complexity4/53/53/5
Audienceresearcherdeveloperdeveloper

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1h+

GPU recommended for efficient inference, model weights are large and hosted on Hugging Face.

In plain English

VibeVoice is a family of open-source AI models from Microsoft for voice-related tasks, covering both speech-to-text (converting spoken audio to written transcripts) and text-to-speech (converting written text into natural-sounding spoken audio). These are the same fundamental tasks used in products like transcription services, voice assistants, and audiobook generators, but released as open model weights that anyone can download and run or fine-tune. The project contains two main components. The ASR (Automatic Speech Recognition) model, called VibeVoice-ASR, is designed to transcribe up to 60 minutes of continuous audio in a single pass rather than breaking it into short chunks like most transcription tools. It outputs structured transcripts that include not just what was said but who said it (speaker identification, also called diarization) and when (timestamps), with support for custom vocabulary terms like technical jargon or proper names. The TTS (Text-to-Speech) model was capable of generating up to 90 minutes of speech with multiple distinct speakers in a single pass, but the TTS model code was removed from the repository in 2025 after it was found to be misused in ways inconsistent with responsible AI principles, a streaming real-time variant remains available. Both components use a novel architecture involving continuous speech tokenizers operating at a low frame rate to handle long audio efficiently. You would use VibeVoice ASR when you need to transcribe long interviews, meetings, podcasts, or lectures with speaker attribution, especially if you want fine-grained control over the model via fine-tuning on domain-specific audio. It is available via the Hugging Face Transformers library and integrates with vLLM for fast inference. The code is written in Python and the model weights are hosted on Hugging Face.

Copy-paste prompts

Prompt 1
Using microsoft/vibevoice ASR via Hugging Face Transformers, write a Python script that loads the model and transcribes a local audio file, printing the transcript with speaker labels and timestamps.
Prompt 2
I want to run microsoft/vibevoice-ASR on a GPU server using vLLM for fast inference. Show me the setup commands and a minimal Python inference example.
Prompt 3
Fine-tune microsoft/vibevoice-ASR on my own audio dataset with domain-specific vocabulary terms. Give me the training script outline and the data format expected.
Prompt 4
How do I add custom vocabulary terms to microsoft/vibevoice-ASR to improve accuracy on technical jargon? Show me the configuration or prompting approach.
Prompt 5
Set up a batch transcription pipeline using microsoft/vibevoice-ASR that processes multiple audio files in a folder and writes each transcript to a matching text file.

Frequently asked questions

What is vibevoice?

VibeVoice is a set of open AI models from Microsoft that can transcribe up to 60 minutes of audio in one pass, including who said what and when, and generate natural-sounding speech, all available to download and run yourself.

What language is vibevoice written in?

Mainly Python. The stack also includes Python, Hugging Face Transformers, vLLM.

How hard is vibevoice to set up?

Setup difficulty is rated hard, with roughly 1h+ to a first successful run.

Who is vibevoice for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub microsoft on gitmyhub

Verify against the repo before relying on details.