explaingit

yeates/aurora

27Audience · researcherComplexity · 5/5ActiveSetup · hard

TLDR

Pre-release research repo for a video editing system that pairs a vision-language agent with a diffusion transformer to handle underspecified edit requests.

Mindmap

mindmap
  root((Aurora))
    Inputs
      User request
      Source video
      Optional reference image
      Optional mask
    Outputs
      Edited clip
      Typed edit plan
      Segmentation mask
      Benchmark scores
    Tasks
      Object replacement
      Object removal
      Style transfer
      Reference insertion
    Tech Stack
      Diffusion transformer
      VLM agent
      Segmentation
      Web image search

Things people build with this

USE CASE 1

Reproduce agent-driven video editing results from the paper once code drops

USE CASE 2

Evaluate a new video editing model on the AgentEdit-Bench benchmark

USE CASE 3

Test object replacement, removal, style transfer, and reference insertion under one model

USE CASE 4

Study how a VLM agent rewrites vague user requests into a typed edit plan

Tech stack

DiTVLMSegmentationPython

Getting it running

Difficulty · hard Time to first run · 1day+

Code is not yet released; README promises a late May 2026 drop and gives no install instructions or dependencies.

In plain English

Aurora is a research project that will host the official code for a paper on agent-driven video editing. The README is a placeholder for now: the actual code has not been published yet, with the authors giving an ETA of late May 2026. Links point to an arXiv paper and a project website. The approach the README describes has two parts working together. The first is a vision-language model agent that reads a raw user request and rewrites it into a typed edit plan with four fields: an instruction, a task label, an image-search query, and a mask phrase. The second is a unified video diffusion transformer that takes that plan and produces the edited clip. The agent talks to outside tools to fill in gaps, for example running a web image search when the user did not supply a reference picture, and running a grounded segmentation model to produce a mask when one is missing. The editing tasks listed in the README cover four kinds of changes under a single set of model weights. Replacement swaps one object or element for another. Removal deletes an object from the clip. Style transfer changes the visual look of the footage. Reference-driven insertion adds something into the clip based on an example image. The project also introduces a benchmark called AgentEdit-Bench, which evaluates this style of agent-enhanced video editing under conditions where the user request is underspecified, either in words or in supporting images. That is the situation where a user might say 'put a red car here' without explaining what the car looks like or where exactly to place it. The README is sparse beyond these points. There is no installation guide, no usage example, no license file mentioned, and no listed dependencies, because the repository is in a pre-release state. Anyone interested in trying the system will need to wait for the planned code drop or read the paper for technical detail.

Copy-paste prompts

Prompt 1
Summarize the Aurora paper's typed edit plan schema with the four fields instruction, task, image search query, and mask phrase
Prompt 2
Compare AgentEdit-Bench from Aurora with existing video editing benchmarks like TGVE and explain what underspecified requests means here
Prompt 3
Sketch a minimal PyTorch wrapper for an agent that calls a grounded segmentation model when the user's mask phrase is missing
Prompt 4
Watch this Aurora repository and alert me when the code drop arrives, then walk me through the first reproducible inference command
Prompt 5
Explain how the Aurora unified diffusion transformer handles object removal versus style transfer under the same weights
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.