xiaomi-research/dasheng-audiogen

★ 23Audience · developerComplexity · 4/5LicenseSetup · hard

Mindmap

mindmap
  root((repo))
    What it does
      Text to audio
      Mixed audio scenes
    Prompt structure
      Caption overall scene
      Speech and dialogue
      Music and SFX
      Environmental ambience
    Models
      English-only version
      Multilingual version
    Tech
      Python
      PyTorch
      Hugging Face

mindmap root((repo)) What it does Text to audio Mixed audio scenes Prompt structure Caption overall scene Speech and dialogue Music and SFX Environmental ambience Models English-only version Multilingual version Tech Python PyTorch Hugging Face

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Generate a film scene audio track combining narration, jazz music, and rain ambience from a single text description.

USE CASE 2

Create sound effects and background music for a game or app from written prompts without hiring audio designers.

USE CASE 3

Batch-generate multiple audio scenes from a list of text descriptions in one model call.

Tech stack

PythonPyTorchHugging Face Transformers

Getting it running

Difficulty · hard Time to first run · 1h+

Requires PyTorch and GPU for reasonable inference speed, multilingual model has notably higher error rates for non-English languages.

Use freely for any purpose, including commercial use, as long as you preserve the license and copyright notice.

In plain English

Dasheng-AudioGen is an AI model that generates audio from text descriptions. Unlike models that produce only speech or only music, this one can produce a mix of different audio types at once: spoken dialogue, background music, sound effects, and environmental sounds, all combined into a single output file. The goal is to generate coherent audio scenes from a written description rather than isolated audio clips. You describe what you want using a structured set of labeled sections. Every prompt must start with a caption that describes the overall scene. From there, you can optionally add a speech section describing the speaker's voice, an asr section containing the actual words that should be spoken, a music section describing the background music, an sfx section for sound effects, and an env section for environmental ambience. The model reads all of those tags together and generates audio that combines them. An example from the README describes a gritty detective narrating over heavy rain and a melancholic jazz saxophone, with the spoken line produced in a deep male voice. The model is available through HuggingFace and works within a Python environment. Installation requires a few packages including PyTorch and the Hugging Face Transformers library. After loading the model, you call a compose prompt method with your labeled sections and pass the result to a generate method that returns audio data, which you then save as a WAV file. Batch inference is supported, meaning you can generate several different audio scenes in one call. Two model versions exist: an English-only version and a multilingual version. The README notes that the multilingual version has notably higher error rates for languages other than English, so the base model is recommended for English use cases. This is a research project developed by Xiaomi Research and SJTU X-LANCE, with an associated paper on arXiv. The code is released under Apache 2.0.

Copy-paste prompts

Prompt 1

Using dasheng-audiogen, write Python code to generate audio of a detective narrating over rain and jazz saxophone, use the caption, speech, asr, music, and env tags in compose_prompt.

Prompt 2

I have dasheng-audiogen set up. How do I batch-generate 5 different audio scenes from a list of text descriptions and save each as a separate WAV file?

Prompt 3

Write a dasheng-audiogen prompt for a coffee shop scene: background chatter, espresso machine sounds, light jazz, and a barista speaking one sentence.

Open on GitHub → Explain another repo

← xiaomi-research on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.