explaingit

zhengdian1/uni-edit

27PythonAudience · researcherComplexity · 5/5ActiveLicenseSetup · hard

TLDR

Research code for a single unified model that handles image understanding, text-to-image generation, and instruction-based editing, trained on one dataset in one stage.

Mindmap

mindmap
  root((Uni-Edit))
    Inputs
      Source images
      Edit instructions
      Text prompts
    Outputs
      Generated images
      Edited images
      Image descriptions
    Use Cases
      Run unified image model
      Train on Uni-Edit-148k
      Reproduce paper results
    Tech Stack
      Python
      PyTorch
      flash-attention
      BAGEL

Things people build with this

USE CASE 1

Run inference for image generation, understanding, or editing from one model

USE CASE 2

Reproduce the paper's results on the Uni-Edit-148k dataset

USE CASE 3

Train a unified vision model on instruction-based editing data

USE CASE 4

Extend the BAGEL or Janus-Pro backbone for your own editing tasks

Tech stack

PythonPyTorchCUDAflash-attentionHugging Face

Getting it running

Difficulty · hard Time to first run · 1day+

Needs a GPU, 54 GB of system RAM to merge checkpoint shards, flash-attention, and a custom safetensors merge step before inference.

Apache 2.0 licensed, free to use, modify, and ship commercially with patent protection as long as the license notice stays in the code.

In plain English

Uni-Edit is the code release for a research paper from a group at the Chinese University of Hong Kong and collaborators. The project is about teaching one AI model to do three related jobs at once: understand images, generate new images from text, and edit existing images according to written instructions. Most existing systems train on a mix of separate datasets for each job and have to juggle conflicting goals across several training stages. The authors argue that intelligent image editing, where the instructions can be complex and contain reasoning, is general enough on its own to cover all three skills. So they train on just one task, with one dataset, in one stage. To make that possible they also built an automated pipeline that turns visual question-answering data into rich editing instructions, producing a dataset called Uni-Edit-148k that pairs each instruction with a high-quality edited image. The repository contains training scripts, inference scripts, and evaluation scripts. It is set up around an existing open model called BAGEL, with a separate Janus-Pro variant tested in the paper. The quick start clones the repo, builds a Python 3.10 conda environment, installs the requirements including flash-attention, and downloads the pretrained checkpoint from Hugging Face. Because the checkpoint uses a custom architecture, you cannot load it through the usual Hugging Face shortcut. You first merge the downloaded shards into one safetensors file using a provided script, which needs at least 54 gigabytes of system memory. After that, one command runs inference for generation, understanding, or editing. The code is released under the Apache 2.0 license.

Copy-paste prompts

Prompt 1
Walk me through setting up the Python 3.10 conda environment with flash-attention for Uni-Edit
Prompt 2
Download the BAGEL checkpoint from Hugging Face and merge the shards into one safetensors file
Prompt 3
Run inference for instruction-based editing on a sample image with a prompt
Prompt 4
Explain how the Uni-Edit-148k dataset was built from visual question-answering data
Prompt 5
Fine-tune Uni-Edit on my own dataset of edit instructions and target images
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.