Add a realistic soundtrack to a silent video clip by describing the sounds, speech, and music you want in a text prompt.
Generate background music and ambient sound effects for short video content without recording real audio.
Research and benchmark AI models that automatically sync audio to video for academic or experimental purposes.
Produce voiceover speech combined with background music for a video scene using a single AI inference run.
Requires Python 3.10, CUDA 12.4, PyTorch 2.6, and FlashAttention. Must download multiple large model weights from Hugging Face (Foley-Omni checkpoint, Wan2.2 text encoder, MMAudio components). GPU with ample VRAM strongly recommended.
Foley-Omni is a Python research project from Nanjing University that generates audio for silent or muted videos using an AI model. Given a video clip and a text description, the model produces a complete soundtrack containing speech, sound effects, and background music together, all synchronized with what is happening on screen. This kind of task, sometimes called video-to-soundtrack generation, is the main focus of the project. The text prompt fed to the model uses a structured format with three optional blocks. A WORDS block specifies what speech should be spoken. An AUDIO_CAPTION block describes ambient sounds, events, and speaker characteristics. A MUSIC block specifies music style, mood, instruments, and tempo. You can include any combination of the three, so you can generate only sound effects, only music, only speech, or all three at once. The model also supports text-only generation without any video input. The current public checkpoint is designed for videos up to 10 seconds long. Running inference involves setting up a YAML config file that points to input videos and their prompt data, then running a Python inference script. The output is an MP4 file with the generated audio merged in. A batch mode accepts a JSON manifest listing multiple videos. Visual features can be pre-extracted to speed up repeated inference on the same footage. Installation requires Python 3.10, CUDA 12.4, PyTorch 2.6, and FlashAttention. Model weights are downloaded from Hugging Face and consist of several components: the Foley-Omni checkpoint itself, a text encoder from the Wan2.2 video model, and pre-trained audio components from MMAudio. The total download is substantial. This is a research code release accompanying an arXiv paper. A benchmark dataset (V2ST-Bench) and a Hugging Face demo are listed as coming soon. No license is stated in the README.
← nju-speech on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.