Build a live captioning tool for video calls or live streams that shows words as people speak.
Create SRT subtitle files from audio or video recordings without sending data to a cloud service.
Replace OpenAI's transcription API with a local, private alternative using the same REST format.
Add multi-speaker labeling to a meeting recorder so the transcript shows who said what.
A GPU is needed for real-time performance, CPU mode works but is significantly slower.
WhisperLiveKit is a self-hosted speech-to-text server designed to transcribe spoken audio in real time with very low delay between when someone speaks and when the text appears. Unlike running a basic transcription model that waits for a full pause before processing, this tool uses research-grade streaming algorithms that process audio incrementally and produce output as speaking continues, not just after a sentence ends. The project supports speaker identification, meaning it can label who is talking when multiple people are in a conversation. It handles translation between roughly 200 languages through a separate translation component. Voice Activity Detection is built in so the server does not waste processing time when no one is speaking. Installation is a single pip command. Once running, the server exposes three different API styles: a REST endpoint that matches the OpenAI audio transcription format (so existing code written against OpenAI can point at it instead), a WebSocket endpoint compatible with the Deepgram SDK, and a native WebSocket for real-time streaming. A Chrome browser extension is included for capturing audio from web pages directly. The tool also works offline for file transcription without starting a server at all. You can feed it an audio or video file and get a plain text transcript or an SRT subtitle file. A model management sub-command lets you download, list, and delete transcription models. Hardware support covers NVIDIA GPUs with CUDA, Apple Silicon via the MLX framework, and standard CPUs. A second model backend called Voxtral Mini (a 4-billion-parameter model from Mistral AI) is offered as an alternative to Whisper, with better per-chunk language detection across 100-plus languages. The code is Apache 2.0 licensed.
← quentinfuxa on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.