Score a new audio-video generation model on the AV-Phys Bench rubrics
Compare human ratings against MLLM-as-judge scores for AV generation
Reproduce paper leaderboard numbers for the seven baseline models
Probe an AV model with Anti-AV-Physics prompts that break a physical rule
Needs a Google AI Studio API key plus a Hugging Face dataset download before any evaluator runs.
AV-Phys is the project page and evaluation code for a research benchmark called AV-Phys Bench. The benchmark asks a focused question: do AI models that generate video and audio together actually understand simple physical commonsense, or do they only mimic patterns from their training data? The work comes from researchers at the University of Texas at Dallas, the University of Washington, and the University of California, Los Angeles, and is described in a paper hosted on arXiv. The benchmark organizes test prompts into three categories of scenes. The first category covers steady situations where the sound source, the action, and the environment all stay the same over time. The second covers event transitions, where a single action changes the physical state of the source, for example a volume knob being turned up. The third covers environment transitions, where the source stays fixed but the path between source and listener changes. Each category also has an Anti-AV-Physics subcategory that deliberately breaks a physical rule, which helps distinguish models that have learned physics from models that have simply memorized physically plausible scenes. Seven existing audio-video generation systems were tested. Each generated video was scored in three ways: by human raters, by a baseline that uses a multimodal large language model as a judge, and by a custom evaluator called the AV-Phys Agent. A live leaderboard and a per-prompt video gallery sit on the linked project page, and the full dataset of prompts, rubrics, generated videos, and human ratings is hosted on Hugging Face. The repository itself contains the evaluator code under a code folder, the static site for the project page under docs, and scripts that build the site. To run the evaluators, users install Python requirements, set a Google AI Studio API key, download the dataset from Hugging Face, then run the multimodal model evaluator or the AV-Phys Agent against the generated videos. Each output is a JSON file with per-rubric verdicts and aggregated scores across the categories named in the paper. The README also explains how to score a new audio-video model: place one MP4 per prompt under a folder named after the model, then point any of the evaluators at that folder. The authors mention that some imperfections may remain and invite outside feedback.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.