explaingit

zijuncui02/av-phys

11PythonAudience · researcherComplexity · 4/5ActiveSetup · moderate

TLDR

Project page and evaluator code for AV-Phys Bench, a benchmark that tests whether audio-video generation models grasp basic physical commonsense.

Mindmap

mindmap
  root((AV-Phys))
    Inputs
      Generated MP4 videos
      Prompt rubrics
      Hugging Face dataset
    Outputs
      Per-rubric JSON verdicts
      Aggregated scores
      Leaderboard entries
    Use Cases
      Score AV generation models
      Compare against baselines
      Reproduce paper results
    Tech Stack
      Python
      Google AI Studio
      Hugging Face

Things people build with this

USE CASE 1

Score a new audio-video generation model on the AV-Phys Bench rubrics

USE CASE 2

Compare human ratings against MLLM-as-judge scores for AV generation

USE CASE 3

Reproduce paper leaderboard numbers for the seven baseline models

USE CASE 4

Probe an AV model with Anti-AV-Physics prompts that break a physical rule

Tech stack

PythonGoogle AI StudioHugging Face

Getting it running

Difficulty · moderate Time to first run · 1h+

Needs a Google AI Studio API key plus a Hugging Face dataset download before any evaluator runs.

In plain English

AV-Phys is the project page and evaluation code for a research benchmark called AV-Phys Bench. The benchmark asks a focused question: do AI models that generate video and audio together actually understand simple physical commonsense, or do they only mimic patterns from their training data? The work comes from researchers at the University of Texas at Dallas, the University of Washington, and the University of California, Los Angeles, and is described in a paper hosted on arXiv. The benchmark organizes test prompts into three categories of scenes. The first category covers steady situations where the sound source, the action, and the environment all stay the same over time. The second covers event transitions, where a single action changes the physical state of the source, for example a volume knob being turned up. The third covers environment transitions, where the source stays fixed but the path between source and listener changes. Each category also has an Anti-AV-Physics subcategory that deliberately breaks a physical rule, which helps distinguish models that have learned physics from models that have simply memorized physically plausible scenes. Seven existing audio-video generation systems were tested. Each generated video was scored in three ways: by human raters, by a baseline that uses a multimodal large language model as a judge, and by a custom evaluator called the AV-Phys Agent. A live leaderboard and a per-prompt video gallery sit on the linked project page, and the full dataset of prompts, rubrics, generated videos, and human ratings is hosted on Hugging Face. The repository itself contains the evaluator code under a code folder, the static site for the project page under docs, and scripts that build the site. To run the evaluators, users install Python requirements, set a Google AI Studio API key, download the dataset from Hugging Face, then run the multimodal model evaluator or the AV-Phys Agent against the generated videos. Each output is a JSON file with per-rubric verdicts and aggregated scores across the categories named in the paper. The README also explains how to score a new audio-video model: place one MP4 per prompt under a folder named after the model, then point any of the evaluators at that folder. The authors mention that some imperfections may remain and invite outside feedback.

Copy-paste prompts

Prompt 1
Set up the AV-Phys evaluator with a Google AI Studio API key and run it on a sample folder of generated MP4s
Prompt 2
Download the AV-Phys dataset from Hugging Face and reproduce the per-category scores from the paper
Prompt 3
Score my own audio-video model on AV-Phys Bench by dropping one MP4 per prompt into a named folder
Prompt 4
Walk me through the AV-Phys Agent evaluator code and explain how each rubric verdict is computed
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.