Analysis updated 2026-06-20
Transcribe a full hour-long meeting recording with speaker labels and timestamps in a single pass
Build a podcast transcription service that identifies different speakers automatically
Fine-tune the speech recognition model on domain-specific audio with custom vocabulary
Stream real-time text-to-speech output for an application using the remaining TTS variant
| microsoft/vibevoice | oobabooga/textgen | d4vinci/scrapling | |
|---|---|---|---|
| Stars | 46,676 | 46,945 | 46,073 |
| Language | Python | Python | Python |
| Setup difficulty | hard | moderate | moderate |
| Complexity | 4/5 | 3/5 | 3/5 |
| Audience | researcher | developer | developer |
Figures from each repo's GitHub metadata at analysis time.
GPU recommended for efficient inference, model weights are large and hosted on Hugging Face.
VibeVoice is a family of open-source AI models from Microsoft for voice-related tasks, covering both speech-to-text (converting spoken audio to written transcripts) and text-to-speech (converting written text into natural-sounding spoken audio). These are the same fundamental tasks used in products like transcription services, voice assistants, and audiobook generators, but released as open model weights that anyone can download and run or fine-tune. The project contains two main components. The ASR (Automatic Speech Recognition) model, called VibeVoice-ASR, is designed to transcribe up to 60 minutes of continuous audio in a single pass rather than breaking it into short chunks like most transcription tools. It outputs structured transcripts that include not just what was said but who said it (speaker identification, also called diarization) and when (timestamps), with support for custom vocabulary terms like technical jargon or proper names. The TTS (Text-to-Speech) model was capable of generating up to 90 minutes of speech with multiple distinct speakers in a single pass, but the TTS model code was removed from the repository in 2025 after it was found to be misused in ways inconsistent with responsible AI principles, a streaming real-time variant remains available. Both components use a novel architecture involving continuous speech tokenizers operating at a low frame rate to handle long audio efficiently. You would use VibeVoice ASR when you need to transcribe long interviews, meetings, podcasts, or lectures with speaker attribution, especially if you want fine-grained control over the model via fine-tuning on domain-specific audio. It is available via the Hugging Face Transformers library and integrates with vLLM for fast inference. The code is written in Python and the model weights are hosted on Hugging Face.
VibeVoice is a set of open AI models from Microsoft that can transcribe up to 60 minutes of audio in one pass, including who said what and when, and generate natural-sounding speech, all available to download and run yourself.
Mainly Python. The stack also includes Python, Hugging Face Transformers, vLLM.
Setup difficulty is rated hard, with roughly 1h+ to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.