zijuncui02/av-phys

Analysis updated 2026-06-24

★ 10PythonAudience · researcherComplexity · 4/5Setup · moderate

Mindmap

mindmap
  root((AV-Phys))
    Inputs
      Generated MP4 videos
      Prompt rubrics
      Hugging Face dataset
    Outputs
      Per-rubric JSON verdicts
      Aggregated scores
      Leaderboard entries
    Use Cases
      Score AV generation models
      Compare against baselines
      Reproduce paper results
    Tech Stack
      Python
      Google AI Studio
      Hugging Face

mindmap root((AV-Phys)) Inputs Generated MP4 videos Prompt rubrics Hugging Face dataset Outputs Per-rubric JSON verdicts Aggregated scores Leaderboard entries Use Cases Score AV generation models Compare against baselines Reproduce paper results Tech Stack Python Google AI Studio Hugging Face

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Score a new audio-video generation model on the AV-Phys Bench rubrics

USE CASE 2

Compare human ratings against MLLM-as-judge scores for AV generation

USE CASE 3

Reproduce paper leaderboard numbers for the seven baseline models

USE CASE 4

Probe an AV model with Anti-AV-Physics prompts that break a physical rule

What is it built with?

PythonGoogle AI StudioHugging Face

How does it compare?

	zijuncui02/av-phys	alsgur9865-sketch/second-brain-engine	compumaxx/gba-video-studio
Stars	10	10	10
Language	Python	Python	Python
Setup difficulty	moderate	moderate	hard
Complexity	4/5	3/5	4/5
Audience	researcher	developer	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 1h+

Needs a Google AI Studio API key plus a Hugging Face dataset download before any evaluator runs.

In plain English

AV-Phys is the project page and evaluation code for a research benchmark called AV-Phys Bench. The benchmark asks a focused question: do AI models that generate video and audio together actually understand simple physical commonsense, or do they only mimic patterns from their training data? The work comes from researchers at the University of Texas at Dallas, the University of Washington, and the University of California, Los Angeles, and is described in a paper hosted on arXiv. The benchmark organizes test prompts into three categories of scenes. The first category covers steady situations where the sound source, the action, and the environment all stay the same over time. The second covers event transitions, where a single action changes the physical state of the source, for example a volume knob being turned up. The third covers environment transitions, where the source stays fixed but the path between source and listener changes. Each category also has an Anti-AV-Physics subcategory that deliberately breaks a physical rule, which helps distinguish models that have learned physics from models that have simply memorized physically plausible scenes. Seven existing audio-video generation systems were tested. Each generated video was scored in three ways: by human raters, by a baseline that uses a multimodal large language model as a judge, and by a custom evaluator called the AV-Phys Agent. A live leaderboard and a per-prompt video gallery sit on the linked project page, and the full dataset of prompts, rubrics, generated videos, and human ratings is hosted on Hugging Face. The repository itself contains the evaluator code under a code folder, the static site for the project page under docs, and scripts that build the site. To run the evaluators, users install Python requirements, set a Google AI Studio API key, download the dataset from Hugging Face, then run the multimodal model evaluator or the AV-Phys Agent against the generated videos. Each output is a JSON file with per-rubric verdicts and aggregated scores across the categories named in the paper. The README also explains how to score a new audio-video model: place one MP4 per prompt under a folder named after the model, then point any of the evaluators at that folder. The authors mention that some imperfections may remain and invite outside feedback.

Copy-paste prompts

Prompt 1

Set up the AV-Phys evaluator with a Google AI Studio API key and run it on a sample folder of generated MP4s

Prompt 2

Download the AV-Phys dataset from Hugging Face and reproduce the per-category scores from the paper

Prompt 3

Score my own audio-video model on AV-Phys Bench by dropping one MP4 per prompt into a named folder

Prompt 4

Walk me through the AV-Phys Agent evaluator code and explain how each rubric verdict is computed

Frequently asked questions

What is av-phys?

Project page and evaluator code for AV-Phys Bench, a benchmark that tests whether audio-video generation models grasp basic physical commonsense.

What language is av-phys written in?

Mainly Python. The stack also includes Python, Google AI Studio, Hugging Face.

How hard is av-phys to set up?

Setup difficulty is rated moderate, with roughly 1h+ to a first successful run.

Who is av-phys for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.