Generate a video of two specific people each following different drawn motion paths
Create a video where subjects stay in bounding-box regions you define while the camera pans
Experiment with identity-preserving video generation using your own reference face photos
Test combined motion control and text guidance in a research setting
Requires downloading ~2.8 GB model weights plus the Wan2.1 base model fetched automatically on first run, then running a Python script with a custom metadata folder.
DreamVideo-Omni is a research project from Alibaba's Tongyi Lab and several partner universities that generates AI videos with fine-grained control over multiple people or objects and how they move. The core challenge it addresses is that existing video generation tools struggle when you want to specify both who or what appears in a video and exactly how each subject should move independently of the others. The system accepts reference images of the subjects you want to appear, a text description, and optional motion cues: drawing paths on frames, bounding boxes that specify where each subject should be, or camera movement instructions. It can handle all three types of motion control at once, hence the name "Omni." To keep each subject recognizable throughout the video, the authors developed a training step that rewards the model when the generated faces and appearances match the references, using a technique they call latent identity reinforcement learning. In practice, generating a video requires downloading the model weights (about 2.8 GB) plus a base model called Wan2.1 (fetched automatically on first run). You then run a Python script called infer.py and point it at a folder containing your reference images and a metadata file with your caption and motion instructions. The README includes three example cases: one using two reference images with no motion paths, one using motion tracks with no reference images, and one combining both. The project was published as an academic paper in March 2026 and the inference code and trained weights were released in May 2026. It is built on top of two existing open-source tools: DiffSynth-Studio and Wan2.1. This is a research release aimed at developers and researchers who want to experiment with controllable video generation, not a consumer product with a graphical interface.
← ali-vilab on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.