explaingit

law1223/alignvid

15PythonAudience · researcherComplexity · 4/5Setup · hard

TLDR

AlignVid is an ICML 2026 research method that fixes AI video and image models ignoring text prompts by rebalancing internal attention at inference time, no retraining of the base model required.

Mindmap

mindmap
  root((AlignVid))
    Problem solved
      Visual dominance
      Text prompt ignored
    How it works
      Attention Scaling
      Guidance Scheduling
      No retraining needed
    Supported tasks
      Image to video
      Text to video
      Image editing
    Benchmark
      OmitI2V dataset
      367 annotated examples
      Evaluation code
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Improve how faithfully FramePack or Wan2.1 follows text prompts when generating video from an image, without any retraining.

USE CASE 2

Evaluate an image-to-video model's text alignment using the OmitI2V benchmark of 367 human-annotated examples.

USE CASE 3

Apply AlignVid to a text-to-image pipeline to reduce cases where the model reproduces the input instead of following the instruction.

Tech stack

PythonPyTorchHugging Face

Getting it running

Difficulty · hard Time to first run · 1day+

Requires setting up FramePack or Wan2.1 base model first, which typically needs a GPU with sufficient VRAM.

In plain English

AlignVid is a research project, accepted at ICML 2026, that addresses a specific problem with AI models that generate video or images from text instructions. The problem is called visual dominance: when you give these models an image and a text prompt asking for significant changes, the model often ignores the text and just reproduces the original image with minor modifications. AlignVid is a method for fixing that without retraining the model. The fix works by adjusting how the model distributes its attention internally during the generation process, specifically rebalancing how much weight the text description gets versus the visual input. This happens entirely inside the model at inference time, with no changes to the model's weights and no additional training data. The two mechanisms involved are called Attention Scaling Modulation, which sharpens the attention signal toward the text, and Guidance Scheduling, which controls when and where in the network that sharpening is applied. The same method works across four types of AI generation tasks: converting an image to video using a text prompt, generating video from text alone, generating images from text, and editing existing images. The authors tested it on several publicly available model families and found it improved how faithfully the outputs matched the text prompt, with less than 0.1 percent added computation time. The code in this repository integrates AlignVid into two specific model families called FramePack and Wan2.1. Using it requires setting up one of those base models first, then enabling AlignVid through a command-line flag when running generation. The default setting uses a single scaling value and the authors report it transfers well across models without needing to search for a different value per model. The repository also includes a benchmark dataset called OmitI2V, which contains 367 human-annotated examples of add, delete, and modify prompts with questions for evaluating how well a model followed the instruction. The dataset is hosted on Hugging Face and the evaluation code is included in the repository.

Copy-paste prompts

Prompt 1
I have FramePack set up and want to enable AlignVid. Show me the exact command-line flag to use and explain what the default attention scaling value controls.
Prompt 2
I want to benchmark my video generation model against OmitI2V. Walk me through downloading the dataset from Hugging Face and running the evaluation code in the AlignVid repo.
Prompt 3
Explain the difference between Attention Scaling Modulation and Guidance Scheduling in AlignVid and show me where in the code each mechanism is applied.
Prompt 4
I am getting visual dominance in my Wan2.1 image-to-video outputs where the model ignores my text prompt. Guide me through integrating AlignVid into my existing generation script.
Open on GitHub → Explain another repo

← law1223 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.