explaingit

quentinfuxa/whisperlivekit

10,299PythonAudience · developerComplexity · 3/5LicenseSetup · moderate

TLDR

A self-hosted server that converts spoken audio to text in real time with very low delay, supports around 200 languages, identifies who is speaking, and works on NVIDIA GPUs, Apple Silicon, and standard CPUs.

Mindmap

mindmap
  root((WhisperLiveKit))
    What it does
      Real-time transcription
      Speaker identification
      Language translation
      Offline file mode
    Tech backends
      Whisper model
      Voxtral Mini
      CUDA and MLX
    API styles
      OpenAI-compatible REST
      Deepgram WebSocket
      Native WebSocket
    Use cases
      Live captioning
      Subtitle generation
      Private transcription
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Build a live captioning tool for video calls or live streams that shows words as people speak.

USE CASE 2

Create SRT subtitle files from audio or video recordings without sending data to a cloud service.

USE CASE 3

Replace OpenAI's transcription API with a local, private alternative using the same REST format.

USE CASE 4

Add multi-speaker labeling to a meeting recorder so the transcript shows who said what.

Tech stack

PythonWebSocketREST APICUDAMLXWhisperVoxtral Mini

Getting it running

Difficulty · moderate Time to first run · 30min

A GPU is needed for real-time performance, CPU mode works but is significantly slower.

Use, modify, and distribute freely for any purpose including commercial use, as long as you include the license notice (Apache 2.0).

In plain English

WhisperLiveKit is a self-hosted speech-to-text server designed to transcribe spoken audio in real time with very low delay between when someone speaks and when the text appears. Unlike running a basic transcription model that waits for a full pause before processing, this tool uses research-grade streaming algorithms that process audio incrementally and produce output as speaking continues, not just after a sentence ends. The project supports speaker identification, meaning it can label who is talking when multiple people are in a conversation. It handles translation between roughly 200 languages through a separate translation component. Voice Activity Detection is built in so the server does not waste processing time when no one is speaking. Installation is a single pip command. Once running, the server exposes three different API styles: a REST endpoint that matches the OpenAI audio transcription format (so existing code written against OpenAI can point at it instead), a WebSocket endpoint compatible with the Deepgram SDK, and a native WebSocket for real-time streaming. A Chrome browser extension is included for capturing audio from web pages directly. The tool also works offline for file transcription without starting a server at all. You can feed it an audio or video file and get a plain text transcript or an SRT subtitle file. A model management sub-command lets you download, list, and delete transcription models. Hardware support covers NVIDIA GPUs with CUDA, Apple Silicon via the MLX framework, and standard CPUs. A second model backend called Voxtral Mini (a 4-billion-parameter model from Mistral AI) is offered as an alternative to Whisper, with better per-chunk language detection across 100-plus languages. The code is Apache 2.0 licensed.

Copy-paste prompts

Prompt 1
Install WhisperLiveKit and start the server with CUDA enabled. Show me how to send an audio file to the OpenAI-compatible REST endpoint and get back a transcript in Python.
Prompt 2
Using WhisperLiveKit, transcribe a local .mp4 video file to an SRT subtitle file without starting the server, just the one-off CLI command.
Prompt 3
I want to use the WhisperLiveKit Chrome extension to caption audio from a browser tab in real time. Walk me through installing the extension and connecting it to my local server.
Prompt 4
How do I enable speaker diarization in WhisperLiveKit so the transcript labels each speaker separately in a multi-person recording?
Open on GitHub → Explain another repo

← quentinfuxa on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.