explaingit

fishaudio/fish-speech

📈 Trending30,400PythonAudience · developerComplexity · 3/5ActiveLicenseSetup · moderate

TLDR

Open-source text-to-speech system that converts written text into natural-sounding spoken audio across 80+ languages, with fine-grained emotional control via inline tags.

Mindmap

mindmap
  root((Fish Speech))
    What it does
      Text to speech
      80+ languages
      Emotional control
    How it works
      S2 Pro model
      Two-stage architecture
      Fine acoustic details
    Use cases
      Audiobook narration
      Voice assistants
      Interactive storytelling
    Tech stack
      Python
      HuggingFace models
      CLI and API
    Key features
      Emotion tags
      Voice cloning
      Web interface

Things people build with this

USE CASE 1

Generate audiobook narration with emotional expression by inserting emotion tags like [whisper] or [excited] into your text.

USE CASE 2

Build a voice assistant or interactive chatbot that speaks responses naturally across multiple languages.

USE CASE 3

Create voice-cloned content at scale by converting large batches of text to speech programmatically via the API.

Tech stack

PythonHuggingFacePyTorch

Getting it running

Difficulty · moderate Time to first run · 30min

PyTorch installation and model downloads from HuggingFace can take 10-15 minutes depending on internet speed and disk space.

Non-commercial use only; check the license terms before using in a commercial product.

In plain English

Fish Speech is an open-source text-to-speech system, meaning software that converts written text into spoken audio. Its focus is on producing speech that sounds natural and expressive, not robotic, across more than 80 languages. The system works using a model called S2 Pro, which has a two-stage architecture. A larger component (described as the "slow" part) reads the text and determines the overall meaning and timing of what is being said. A smaller, faster component then fills in the fine acoustic details that make the voice sound realistic. Together they produce audio that scores highly on benchmarks measuring how close AI-generated speech sounds to a real human speaker. A key feature is fine-grained emotional control: you can insert short tags directly into the text, such as [whisper], [excited], or [laughing], at any point, and the model adjusts how those words are spoken accordingly. This makes it suitable for applications like audiobook narration, voice assistants, or interactive storytelling where tone and emotion matter. You would use this if you need to generate realistic spoken audio from text programmatically, for example, building a voice interface, generating audio content at scale, or experimenting with voice cloning. It can be run from a command line, through a web interface, or via a server API. The tech stack is Python, and the model weights are published on HuggingFace. The license restricts usage to non-commercial purposes; check the terms before using in a product.

Copy-paste prompts

Prompt 1
How do I use Fish Speech to convert a text file to audio with emotional tags like [whisper] and [excited]?
Prompt 2
Show me how to set up Fish Speech's S2 Pro model locally and generate speech in different languages.
Prompt 3
How can I integrate Fish Speech into a Python application to generate voice output from user input?
Prompt 4
What's the difference between the slow and fast components in Fish Speech's two-stage architecture, and how do I control them?
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.