sparkaudio/spark-tts

★ 10,984PythonAudience · researcherComplexity · 4/5LicenseSetup · hard

Mindmap

mindmap
  root((Spark-TTS))
    What it does
      Text to speech
      Voice cloning
      Bilingual output
    Voice controls
      Pitch adjustment
      Speed control
      Gender selection
    Tech stack
      Python
      Gradio
      Hugging Face
    Deployment
      Command line
      Web browser
      Triton server

mindmap root((Spark-TTS)) What it does Text to speech Voice cloning Bilingual output Voice controls Pitch adjustment Speed control Gender selection Tech stack Python Gradio Hugging Face Deployment Command line Web browser Triton server

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Clone a person's voice from a short audio sample and generate new speech in that voice for any text you provide.

USE CASE 2

Create synthetic voiceovers in Chinese or English with controllable pitch, speaking speed, and gender characteristics.

USE CASE 3

Deploy a text-to-speech service at higher volumes using the Triton server option for production workloads.

USE CASE 4

Run a local voice cloning demo in a browser interface with Gradio to experiment with different voice parameters.

Tech stack

PythonGradioHugging FaceTriton

Getting it running

Difficulty · hard Time to first run · 1h+

Requires downloading a 0.5B parameter model from Hugging Face and a GPU for practical inference performance.

Free to use and modify for any purpose including commercial under the Apache 2.0 license.

In plain English

Spark-TTS is a text-to-speech system that converts written text into spoken audio. It was developed by researchers at several universities and published alongside a research paper. The model is built on top of a large language model (an AI system trained on large amounts of text), which it uses to produce natural-sounding speech rather than the older rule-based or simpler statistical approaches. One of the main things Spark-TTS can do is voice cloning without prior training on that specific voice. You give it a short audio clip of someone speaking, and it can then generate new speech that sounds like that person saying whatever text you provide. It supports both Chinese and English, and can switch between languages within the same output. You can also control certain properties of the generated voice, such as whether the speaker sounds male or female, how high or low the pitch is, and how fast they speak. This makes it possible to create entirely new virtual voices with adjustable characteristics, not just clone existing ones. The code is written in Python and runs on Linux (with a separate community guide available for Windows). Setup involves downloading a 0.5 billion parameter model file from Hugging Face, installing the required Python packages, and then running inference either from the command line or through a web browser interface built with Gradio. A server deployment option using Nvidia's Triton software is also included for teams that need to run the system at higher volumes. The project is Apache 2.0 licensed and available on Hugging Face for model downloads.

Copy-paste prompts

Prompt 1

I've set up Spark-TTS on Linux. Show me the command to clone a voice from a 10-second WAV file and generate speech from a text string.

Prompt 2

How do I adjust the pitch and speaking speed when generating text-to-speech output with Spark-TTS from the command line?

Prompt 3

Show me how to run Spark-TTS as a Gradio web app so I can test voice cloning through a browser interface.

Prompt 4

Walk me through deploying Spark-TTS on an Nvidia Triton server for production use with multiple concurrent requests.

Prompt 5

How do I download the Spark-TTS 0.5B model from Hugging Face and run my first voice cloning generation?

Open on GitHub → Explain another repo

← sparkaudio on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.