VACE is an AI model for video creation and editing released by Alibaba's Tongyi Lab. It was accepted at ICCV 2025, one of the top computer vision research conferences. The model handles a range of video tasks from a single unified system rather than requiring separate tools for each operation. The three core modes are: generating a video from a reference image or video clip (R2V), editing an existing video (V2V), and editing a specific region of a video defined by a mask (MV2V). In practice this covers things like moving an object within a scene, replacing one object with another, animating a still image, extending the edges of a video frame outward, or making a character in a video follow the motion of a reference person. These tasks can be combined in a single pipeline. Several model versions are available. The smaller ones (1.3B parameters) run at resolutions around 480x832 pixels across roughly 80 frames. The largest model (14B parameters) targets 720x1280 resolution. All models are hosted on HuggingFace and ModelScope, and the two underlying video generation systems they build on are Wan2.1 (from Alibaba) and LTX-Video (from Lightricks). Licenses vary by model: most of the Wan-based models are Apache-2.0, while the LTX-based model carries a separate RAIL-M license. Setup requires Python 3.10, CUDA 12.4, and PyTorch 2.5.1 or later, so a reasonably recent Nvidia GPU is needed. After installing dependencies, you download the model weights and run generation from the command line by specifying a task name, a video file, and a text prompt. A Gradio-based graphical interface is also included for interactive use. Preprocessing tools for extracting depth maps, pose information, or flow data are available as an optional separate install. The codebase also includes a benchmark dataset for evaluating the model's output quality across the supported task types.
← ali-vilab on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.