explaingit

fishaudio/fish-speech

Analysis updated 2026-06-20

30,099PythonAudience · developerComplexity · 3/5LicenseSetup · moderate

TLDR

Fish Speech is an open-source text-to-speech system that converts written text into realistic spoken audio across 80+ languages, with fine-grained emotional control using inline tags like [whisper] or [excited] inserted directly into the text.

Mindmap

mindmap
  root((fish-speech))
    What it does
      Text to speech
      80 plus languages
      Emotional control
      Voice cloning
    Architecture
      Slow meaning model
      Fast acoustic model
      Two-stage pipeline
    Use cases
      Audiobook narration
      Voice assistants
      Content generation
    Interfaces
      Command line
      Web UI
      Server API
    Audience
      Developers
      Content creators
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Generate realistic voiceover audio for an audiobook or narration project with emotional tone control built into the script.

USE CASE 2

Build a voice interface or assistant that speaks back to users with natural-sounding, expressive speech in multiple languages.

USE CASE 3

Automate audio content creation at scale by sending text to the server API and receiving spoken audio files in return.

USE CASE 4

Experiment with voice generation and emotional expression by inserting tags like [laughing] or [whisper] into sample text via the web interface.

What is it built with?

Python

How does it compare?

fishaudio/fish-speechencode/django-rest-frameworktrailofbits/algo
Stars30,09929,99630,216
LanguagePythonPythonPython
Setup difficultymoderatemoderatemoderate
Complexity3/53/53/5
Audiencedeveloperdeveloperops devops

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 1h+

Requires downloading model weights from HuggingFace and a GPU is recommended for acceptable generation speed, non-commercial license restricts production use.

Non-commercial use only, you cannot use this in a product or service that makes money without checking and complying with the specific license terms.

In plain English

Fish Speech is an open-source text-to-speech system, meaning software that converts written text into spoken audio. Its focus is on producing speech that sounds natural and expressive, not robotic, across more than 80 languages. The system works using a model called S2 Pro, which has a two-stage architecture. A larger component (described as the "slow" part) reads the text and determines the overall meaning and timing of what is being said. A smaller, faster component then fills in the fine acoustic details that make the voice sound realistic. Together they produce audio that scores highly on benchmarks measuring how close AI-generated speech sounds to a real human speaker. A key feature is fine-grained emotional control: you can insert short tags directly into the text, such as [whisper], [excited], or [laughing], at any point, and the model adjusts how those words are spoken accordingly. This makes it suitable for applications like audiobook narration, voice assistants, or interactive storytelling where tone and emotion matter. You would use this if you need to generate realistic spoken audio from text programmatically, for example, building a voice interface, generating audio content at scale, or experimenting with voice cloning. It can be run from a command line, through a web interface, or via a server API. The tech stack is Python, and the model weights are published on HuggingFace. The license restricts usage to non-commercial purposes, check the terms before using in a product.

Copy-paste prompts

Prompt 1
I want to generate a short audiobook chapter using Fish Speech with some lines whispered and others excited. Show me how to write the input text with [whisper] and [excited] tags and call the API to produce the audio file.
Prompt 2
Walk me through installing Fish Speech locally in Python, downloading the S2 Pro model weights from HuggingFace, and running a basic text-to-speech conversion from the command line.
Prompt 3
I want to call the Fish Speech server API from my Python app to convert a paragraph of text to speech and save it as an MP3. Write me the code to do that.
Prompt 4
Fish Speech supports 80+ languages. How do I generate speech in Spanish and Japanese from the same Python script, and does the model switch automatically based on the input text?

Frequently asked questions

What is fish-speech?

Fish Speech is an open-source text-to-speech system that converts written text into realistic spoken audio across 80+ languages, with fine-grained emotional control using inline tags like [whisper] or [excited] inserted directly into the text.

What language is fish-speech written in?

Mainly Python. The stack also includes Python.

What license does fish-speech use?

Non-commercial use only, you cannot use this in a product or service that makes money without checking and complying with the specific license terms.

How hard is fish-speech to set up?

Setup difficulty is rated moderate, with roughly 1h+ to a first successful run.

Who is fish-speech for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub fishaudio on gitmyhub

Verify against the repo before relying on details.