Generate spoken audio from text in 10 languages for a voice assistant, podcast tool, or accessibility feature
Clone a specific speaker's voice from a 3-second audio sample to produce personalized speech output
Describe a voice in plain text (age, gender, accent, emotion) and generate matching speech without a pre-recorded sample
Stream real-time text-to-speech into an application with under 100ms latency to the first audio packet
Requires a capable GPU for local inference, a hosted API is available via Alibaba Cloud for those without GPU hardware.
Qwen3-TTS is a collection of open-source text-to-speech models built by the Qwen team at Alibaba Cloud. The models take written text as input and produce spoken audio as output, covering ten languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian. Several regional dialect voice profiles are also included. The collection ships multiple model variants tuned for different tasks. One variant lets you describe a voice in plain text (age, gender, accent, emotion) and the model generates audio in that style. Another variant clones an existing voice from a short three-second audio sample, so you can reproduce a specific speaker's sound. A third variant offers nine pre-built premium voices with controllable style. All variants support streaming output, meaning audio can start playing almost immediately rather than waiting for the full clip to render. The README highlights a latency figure of 97 milliseconds from the moment text arrives to the first audio packet being sent out. The underlying architecture avoids some common two-stage designs (a language model feeding a separate diffusion model) in favor of a single end-to-end approach, which the team says reduces errors that can creep in when two separate systems are chained together. Two model sizes are available: 0.6B and 1.7B parameters. Smaller models run faster and need less hardware, larger models generally produce higher-quality or more controllable output. The models can be loaded through the qwen-tts Python package or through vLLM, a popular high-throughput inference server. Fine-tuning on custom data is also supported for teams that need a specialized voice style. A hosted API is available via Alibaba Cloud for those who do not want to run the models locally. The repository includes a local web demo, code examples for each major use case, and links to model weights on Hugging Face and ModelScope. The full README is longer than what was shown.
← qwenlm on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.