explaingit

jamiepine/voicebox

26,710TypeScriptAudience · vibe coderComplexity · 3/5ActiveLicenseSetup · hard

TLDR

A free desktop app that clones voices, generates speech in 23 languages, and transcribes dictation, all locally, no cloud.

Mindmap

mindmap
  root((Voicebox))
    Voice Output
      Clone any voice
      Text to speech
      23 languages
      7 TTS engines
    Voice Input
      Global hotkey
      Whisper transcription
      Auto-paste anywhere
    Creative Features
      Expressive tags
      Audio effects
      Multi-speaker podcasts
      Timeline editor
    Tech Stack
      Tauri framework
      TypeScript
      Local processing
      MCP server
    Platforms
      macOS
      Windows
      Linux
      Docker

Things people build with this

USE CASE 1

Clone your voice and generate audiobook narrations or podcast intros without hiring voice actors.

USE CASE 2

Dictate notes, emails, and code comments hands-free using a global hotkey in any app.

USE CASE 3

Build AI agents in Claude or Cursor that speak responses aloud in custom cloned voices.

USE CASE 4

Create multi-speaker podcast episodes with different voices in a visual timeline editor.

Tech stack

TypeScriptTauriRustWhisperQwen3-TTSKokoro

Getting it running

Difficulty · hard Time to first run · 1day+

Requires downloading large ML models (Whisper, Qwen3-TTS, Kokoro) and building Tauri desktop app with Rust dependencies.

Free and open-source; you can use, modify, and distribute it freely.

In plain English

Voicebox is a free, open-source desktop application that serves as a complete local AI voice studio, letting you clone voices, generate speech, and dictate into any app, all without sending data to the cloud. It positions itself as a combined local alternative to ElevenLabs (for voice output) and WisprFlow (for voice input). On the output side, you can clone any voice from a short audio sample and use it to convert text to speech in 23 languages, choosing from seven different text-to-speech engines, including Qwen3-TTS, Chatterbox, Kokoro, and HumeAI TADA, each with different strengths in quality, speed, and language coverage. You can add expressive tags like [laugh] or [sigh] to control delivery, apply audio effects like reverb or pitch shift, and even generate multi-speaker podcast-style conversations in a visual timeline editor. On the input side, a global keyboard hotkey activates voice dictation anywhere on your computer using Whisper-based speech recognition, automatically pasting the transcribed text into whatever field you are typing in. For AI power users, Voicebox exposes an API and a built-in MCP server (a standard for connecting AI tools), so agents running in tools like Claude Code or Cursor can call a single command to speak responses aloud in a cloned voice. All processing happens locally, nothing leaves your machine. It runs on macOS, Windows, Linux, and Docker, and is built with Tauri (a Rust-based framework for native desktop apps) with a TypeScript interface.

Copy-paste prompts

Prompt 1
How do I clone a voice in Voicebox and use it to generate speech in multiple languages?
Prompt 2
Show me how to set up the global voice dictation hotkey so I can transcribe text into any app.
Prompt 3
How can I integrate Voicebox's API or MCP server so my Claude Code agent can speak responses aloud?
Prompt 4
What are the differences between the seven TTS engines in Voicebox, and which one should I use for podcast quality?
Prompt 5
How do I create a multi-speaker podcast conversation using Voicebox's timeline editor?
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.