explaingit

microsoft/vibevoice

47,258PythonAudience · developerComplexity · 3/5ActiveLicenseSetup · moderate

TLDR

Open-source AI models from Microsoft for transcribing long audio (up to 60 minutes) with speaker identification and timestamps, plus text-to-speech capabilities.

Mindmap

mindmap
  root((VibeVoice))
    What it does
      Speech to text
      Text to speech
      Speaker identification
    Key features
      Long audio handling
      Custom vocabulary
      Structured output
    How to use
      Hugging Face integration
      Fine-tuning support
      vLLM inference
    Tech stack
      Python
      Transformers
      Hugging Face

Things people build with this

USE CASE 1

Transcribe long interviews, meetings, or podcasts with automatic speaker identification and timestamps.

USE CASE 2

Fine-tune the ASR model on domain-specific audio to recognize technical jargon or proper names in your industry.

USE CASE 3

Build a voice assistant or transcription service using the open model weights without relying on proprietary APIs.

Tech stack

PythonHugging Face TransformersvLLMPyTorch

Getting it running

Difficulty · moderate Time to first run · 30min

Requires PyTorch and large model downloads from Hugging Face; GPU recommended for reasonable inference speed.

Open-source models available for download and use; check Hugging Face repository for specific license terms on model weights.

In plain English

VibeVoice is a family of open-source AI models from Microsoft for voice-related tasks, covering both speech-to-text (converting spoken audio to written transcripts) and text-to-speech (converting written text into natural-sounding spoken audio). These are the same fundamental tasks used in products like transcription services, voice assistants, and audiobook generators, but released as open model weights that anyone can download and run or fine-tune. The project contains two main components. The ASR (Automatic Speech Recognition) model, called VibeVoice-ASR, is designed to transcribe up to 60 minutes of continuous audio in a single pass rather than breaking it into short chunks like most transcription tools. It outputs structured transcripts that include not just what was said but who said it (speaker identification, also called diarization) and when (timestamps), with support for custom vocabulary terms like technical jargon or proper names. The TTS (Text-to-Speech) model was capable of generating up to 90 minutes of speech with multiple distinct speakers in a single pass, but the TTS model code was removed from the repository in 2025 after it was found to be misused in ways inconsistent with responsible AI principles, a streaming real-time variant remains available. Both components use a novel architecture involving continuous speech tokenizers operating at a low frame rate to handle long audio efficiently. You would use VibeVoice ASR when you need to transcribe long interviews, meetings, podcasts, or lectures with speaker attribution, especially if you want fine-grained control over the model via fine-tuning on domain-specific audio. It is available via the Hugging Face Transformers library and integrates with vLLM for fast inference. The code is written in Python and the model weights are hosted on Hugging Face.

Copy-paste prompts

Prompt 1
How do I use VibeVoice ASR to transcribe a 45-minute podcast with speaker diarization and timestamps?
Prompt 2
Show me how to fine-tune VibeVoice on custom audio data to recognize industry-specific terminology.
Prompt 3
How do I integrate VibeVoice with vLLM for fast real-time speech-to-text inference?
Prompt 4
What's the difference between VibeVoice ASR and other open-source transcription models like Whisper?
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.