explaingit

ali-vilab/dreamvideo-omni

14PythonAudience · researcherComplexity · 4/5Setup · hard

TLDR

A research tool from Alibaba that generates AI videos with fine-grained control over multiple people and objects, letting you specify reference images, motion paths, bounding boxes, and camera movements all at once.

Mindmap

mindmap
  root((DreamVideo-Omni))
    What it does
      Generates AI videos
      Multi-subject motion
      Identity preservation
    Inputs
      Reference images
      Text captions
      Motion paths
    Tech
      Python
      DiffSynth-Studio
      Wan2.1 base model
    Use Cases
      Research experiments
      Custom motion videos
      Controllable generation
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Generate a video of two specific people each following different drawn motion paths

USE CASE 2

Create a video where subjects stay in bounding-box regions you define while the camera pans

USE CASE 3

Experiment with identity-preserving video generation using your own reference face photos

USE CASE 4

Test combined motion control and text guidance in a research setting

Tech stack

PythonDiffSynth-StudioWan2.1

Getting it running

Difficulty · hard Time to first run · 1h+

Requires downloading ~2.8 GB model weights plus the Wan2.1 base model fetched automatically on first run, then running a Python script with a custom metadata folder.

License not specified in the README.

In plain English

DreamVideo-Omni is a research project from Alibaba's Tongyi Lab and several partner universities that generates AI videos with fine-grained control over multiple people or objects and how they move. The core challenge it addresses is that existing video generation tools struggle when you want to specify both who or what appears in a video and exactly how each subject should move independently of the others. The system accepts reference images of the subjects you want to appear, a text description, and optional motion cues: drawing paths on frames, bounding boxes that specify where each subject should be, or camera movement instructions. It can handle all three types of motion control at once, hence the name "Omni." To keep each subject recognizable throughout the video, the authors developed a training step that rewards the model when the generated faces and appearances match the references, using a technique they call latent identity reinforcement learning. In practice, generating a video requires downloading the model weights (about 2.8 GB) plus a base model called Wan2.1 (fetched automatically on first run). You then run a Python script called infer.py and point it at a folder containing your reference images and a metadata file with your caption and motion instructions. The README includes three example cases: one using two reference images with no motion paths, one using motion tracks with no reference images, and one combining both. The project was published as an academic paper in March 2026 and the inference code and trained weights were released in May 2026. It is built on top of two existing open-source tools: DiffSynth-Studio and Wan2.1. This is a research release aimed at developers and researchers who want to experiment with controllable video generation, not a consumer product with a graphical interface.

Copy-paste prompts

Prompt 1
I have reference images of two people and want to generate a video where each follows a different motion path. How do I set up the metadata file and run infer.py in dreamvideo-omni?
Prompt 2
Help me write the metadata JSON for dreamvideo-omni to place subjects inside bounding boxes while also applying a panning camera movement.
Prompt 3
I cloned dreamvideo-omni and the base model Wan2.1 should download automatically on first run. What folder structure do I need for reference images and how do I format the caption in the input config?
Prompt 4
Show me how to run the dreamvideo-omni example that combines motion tracks with reference images from the README.
Open on GitHub → Explain another repo

← ali-vilab on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.