yeates/aurora

Analysis updated 2026-06-24

★ 25Audience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((Aurora))
    Inputs
      User request
      Source video
      Optional reference image
      Optional mask
    Outputs
      Edited clip
      Typed edit plan
      Segmentation mask
      Benchmark scores
    Tasks
      Object replacement
      Object removal
      Style transfer
      Reference insertion
    Tech Stack
      Diffusion transformer
      VLM agent
      Segmentation
      Web image search

mindmap root((Aurora)) Inputs User request Source video Optional reference image Optional mask Outputs Edited clip Typed edit plan Segmentation mask Benchmark scores Tasks Object replacement Object removal Style transfer Reference insertion Tech Stack Diffusion transformer VLM agent Segmentation Web image search

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Reproduce agent-driven video editing results from the paper once code drops

USE CASE 2

Evaluate a new video editing model on the AgentEdit-Bench benchmark

USE CASE 3

Test object replacement, removal, style transfer, and reference insertion under one model

USE CASE 4

Study how a VLM agent rewrites vague user requests into a typed edit plan

What is it built with?

DiTVLMSegmentationPython

How does it compare?

	yeates/aurora	andyvandaric/kiroku	appeight/app8-ios-sdk
Stars	25	25	25
Language	—	PowerShell	Swift
Setup difficulty	hard	moderate	moderate
Complexity	5/5	2/5	4/5
Audience	researcher	developer	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Code is not yet released, README promises a late May 2026 drop and gives no install instructions or dependencies.

In plain English

Aurora is a research project that will host the official code for a paper on agent-driven video editing. The README is a placeholder for now: the actual code has not been published yet, with the authors giving an ETA of late May 2026. Links point to an arXiv paper and a project website. The approach the README describes has two parts working together. The first is a vision-language model agent that reads a raw user request and rewrites it into a typed edit plan with four fields: an instruction, a task label, an image-search query, and a mask phrase. The second is a unified video diffusion transformer that takes that plan and produces the edited clip. The agent talks to outside tools to fill in gaps, for example running a web image search when the user did not supply a reference picture, and running a grounded segmentation model to produce a mask when one is missing. The editing tasks listed in the README cover four kinds of changes under a single set of model weights. Replacement swaps one object or element for another. Removal deletes an object from the clip. Style transfer changes the visual look of the footage. Reference-driven insertion adds something into the clip based on an example image. The project also introduces a benchmark called AgentEdit-Bench, which evaluates this style of agent-enhanced video editing under conditions where the user request is underspecified, either in words or in supporting images. That is the situation where a user might say 'put a red car here' without explaining what the car looks like or where exactly to place it. The README is sparse beyond these points. There is no installation guide, no usage example, no license file mentioned, and no listed dependencies, because the repository is in a pre-release state. Anyone interested in trying the system will need to wait for the planned code drop or read the paper for technical detail.

Copy-paste prompts

Prompt 1

Summarize the Aurora paper's typed edit plan schema with the four fields instruction, task, image search query, and mask phrase

Prompt 2

Compare AgentEdit-Bench from Aurora with existing video editing benchmarks like TGVE and explain what underspecified requests means here

Prompt 3

Sketch a minimal PyTorch wrapper for an agent that calls a grounded segmentation model when the user's mask phrase is missing

Prompt 4

Watch this Aurora repository and alert me when the code drop arrives, then walk me through the first reproducible inference command

Prompt 5

Explain how the Aurora unified diffusion transformer handles object removal versus style transfer under the same weights

Frequently asked questions

What is aurora?

Pre-release research repo for a video editing system that pairs a vision-language agent with a diffusion transformer to handle underspecified edit requests.

How hard is aurora to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is aurora for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.