Lance is a research release from ByteDance that tries to put four different vision tasks into one model: understanding what is in an image or video, generating new images and video from text, and editing existing images and video. The README calls it a unified multimodal model, meaning the same set of weights handles all of these jobs instead of needing a separate model for each. The headline number is the size. Lance has 3 billion active parameters, which is small for a model that claims to compete on image generation, image editing, and video generation benchmarks at the same time. The authors say the transformer backbone was trained from scratch, with only the vision encoder (ViT) and the autoencoder (VAE) reused from existing work. Training was done on a 128 A100 GPU budget, which the paper presents as modest for this kind of multi-task model. The repository links to a project homepage, an arXiv paper, and a Hugging Face page where the model weights are published. Most of the README space is a gallery of demo clips: text prompts turned into short videos, edits applied to existing video clips, multi-turn editing where consistency is kept across rounds, an intelligent video generation section, and video question answering examples where the model picks the right answer from multiple choice options about a clip. What the README does not include in the part shown is concrete install instructions, training code layout, or hardware requirements for running it locally. Readers who want to actually use Lance would need to follow the Hugging Face model page or read further into the repo.
Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.