Generate a film scene audio track combining narration, jazz music, and rain ambience from a single text description.
Create sound effects and background music for a game or app from written prompts without hiring audio designers.
Batch-generate multiple audio scenes from a list of text descriptions in one model call.
Requires PyTorch and GPU for reasonable inference speed, multilingual model has notably higher error rates for non-English languages.
Dasheng-AudioGen is an AI model that generates audio from text descriptions. Unlike models that produce only speech or only music, this one can produce a mix of different audio types at once: spoken dialogue, background music, sound effects, and environmental sounds, all combined into a single output file. The goal is to generate coherent audio scenes from a written description rather than isolated audio clips. You describe what you want using a structured set of labeled sections. Every prompt must start with a caption that describes the overall scene. From there, you can optionally add a speech section describing the speaker's voice, an asr section containing the actual words that should be spoken, a music section describing the background music, an sfx section for sound effects, and an env section for environmental ambience. The model reads all of those tags together and generates audio that combines them. An example from the README describes a gritty detective narrating over heavy rain and a melancholic jazz saxophone, with the spoken line produced in a deep male voice. The model is available through HuggingFace and works within a Python environment. Installation requires a few packages including PyTorch and the Hugging Face Transformers library. After loading the model, you call a compose prompt method with your labeled sections and pass the result to a generate method that returns audio data, which you then save as a WAV file. Batch inference is supported, meaning you can generate several different audio scenes in one call. Two model versions exist: an English-only version and a multilingual version. The README notes that the multilingual version has notably higher error rates for languages other than English, so the base model is recommended for English use cases. This is a research project developed by Xiaomi Research and SJTU X-LANCE, with an associated paper on arXiv. The code is released under Apache 2.0.
← xiaomi-research on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.