Clone a person's voice from a short audio sample and generate new speech in that voice for any text you provide.
Create synthetic voiceovers in Chinese or English with controllable pitch, speaking speed, and gender characteristics.
Deploy a text-to-speech service at higher volumes using the Triton server option for production workloads.
Run a local voice cloning demo in a browser interface with Gradio to experiment with different voice parameters.
Requires downloading a 0.5B parameter model from Hugging Face and a GPU for practical inference performance.
Spark-TTS is a text-to-speech system that converts written text into spoken audio. It was developed by researchers at several universities and published alongside a research paper. The model is built on top of a large language model (an AI system trained on large amounts of text), which it uses to produce natural-sounding speech rather than the older rule-based or simpler statistical approaches. One of the main things Spark-TTS can do is voice cloning without prior training on that specific voice. You give it a short audio clip of someone speaking, and it can then generate new speech that sounds like that person saying whatever text you provide. It supports both Chinese and English, and can switch between languages within the same output. You can also control certain properties of the generated voice, such as whether the speaker sounds male or female, how high or low the pitch is, and how fast they speak. This makes it possible to create entirely new virtual voices with adjustable characteristics, not just clone existing ones. The code is written in Python and runs on Linux (with a separate community guide available for Windows). Setup involves downloading a 0.5 billion parameter model file from Hugging Face, installing the required Python packages, and then running inference either from the command line or through a web browser interface built with Gradio. A server deployment option using Nvidia's Triton software is also included for teams that need to run the system at higher volumes. The project is Apache 2.0 licensed and available on Hugging Face for model downloads.
← sparkaudio on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.