explaingit

nju-speech/foley-omni

15PythonAudience · researcherComplexity · 4/5Setup · hard

TLDR

AI research tool from Nanjing University that generates synchronized soundtracks (speech, sound effects, and music together) for silent video clips using a text prompt to describe what you want to hear.

Mindmap

mindmap
  root((repo))
  Audio Generation
    Speech synthesis
    Sound effects
    Background music
    Full soundtrack
  Text Prompt Format
    WORDS block
    AUDIO CAPTION block
    MUSIC block
  Video Input
    Up to 10 seconds
    Visual feature extraction
    Batch mode JSON
  AI Model Stack
    Foley-Omni checkpoint
    Wan2.2 text encoder
    MMAudio components
  Setup Requirements
    Python 3.10
    CUDA 12.4
    PyTorch 2.6
    FlashAttention
  Output
    MP4 with audio
    arXiv paper
    V2ST-Bench benchmark
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Add a realistic soundtrack to a silent video clip by describing the sounds, speech, and music you want in a text prompt.

USE CASE 2

Generate background music and ambient sound effects for short video content without recording real audio.

USE CASE 3

Research and benchmark AI models that automatically sync audio to video for academic or experimental purposes.

USE CASE 4

Produce voiceover speech combined with background music for a video scene using a single AI inference run.

Tech stack

PythonPyTorchCUDAFlashAttentionHugging FaceYAMLJSONMMAudio

Getting it running

Difficulty · hard Time to first run · 1day+

Requires Python 3.10, CUDA 12.4, PyTorch 2.6, and FlashAttention. Must download multiple large model weights from Hugging Face (Foley-Omni checkpoint, Wan2.2 text encoder, MMAudio components). GPU with ample VRAM strongly recommended.

No license is stated. All rights are reserved by default, you should not use, copy, or distribute this code without explicit permission from the authors.

In plain English

Foley-Omni is a Python research project from Nanjing University that generates audio for silent or muted videos using an AI model. Given a video clip and a text description, the model produces a complete soundtrack containing speech, sound effects, and background music together, all synchronized with what is happening on screen. This kind of task, sometimes called video-to-soundtrack generation, is the main focus of the project. The text prompt fed to the model uses a structured format with three optional blocks. A WORDS block specifies what speech should be spoken. An AUDIO_CAPTION block describes ambient sounds, events, and speaker characteristics. A MUSIC block specifies music style, mood, instruments, and tempo. You can include any combination of the three, so you can generate only sound effects, only music, only speech, or all three at once. The model also supports text-only generation without any video input. The current public checkpoint is designed for videos up to 10 seconds long. Running inference involves setting up a YAML config file that points to input videos and their prompt data, then running a Python inference script. The output is an MP4 file with the generated audio merged in. A batch mode accepts a JSON manifest listing multiple videos. Visual features can be pre-extracted to speed up repeated inference on the same footage. Installation requires Python 3.10, CUDA 12.4, PyTorch 2.6, and FlashAttention. Model weights are downloaded from Hugging Face and consist of several components: the Foley-Omni checkpoint itself, a text encoder from the Wan2.2 video model, and pre-trained audio components from MMAudio. The total download is substantial. This is a research code release accompanying an arXiv paper. A benchmark dataset (V2ST-Bench) and a Hugging Face demo are listed as coming soon. No license is stated in the README.

Copy-paste prompts

Prompt 1
I have a 5-second silent video of someone walking in a park. Write me a Foley-Omni prompt using the AUDIO_CAPTION and MUSIC blocks to generate footsteps, birds chirping, and gentle acoustic guitar.
Prompt 2
Explain the three prompt blocks in Foley-Omni (WORDS, AUDIO_CAPTION, MUSIC) with a concrete example for a cooking video that needs narration, sizzling sounds, and upbeat background music.
Prompt 3
Show me how to set up the YAML config file for Foley-Omni to run inference on a single video file and save the output MP4 with merged audio.
Prompt 4
How do I use Foley-Omni in batch mode? Give me an example JSON manifest for processing three different video clips with different prompts.
Prompt 5
What are the minimum hardware requirements to run Foley-Omni inference, and how do I pre-extract visual features to speed up repeated runs on the same video?
Open on GitHub → Explain another repo

← nju-speech on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.