yczhou001/pf-opsd

★ 17PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((pf-opsd))
    Research Goal
      Combine world models with LLMs
      Controlled concrete reasoning
      Spatial and physical tasks
    Benchmarks
      VRQABench mazes and Sokoban
      OpenWorldQA physical videos
      Hugging Face downloads
    Training Method
      PF-OPSD approach
      Privileged future frames
      Teacher-student learning
    Setup
      Python and video datasets
      External LLM API key
      Three code modules

mindmap root((pf-opsd)) Research Goal Combine world models with LLMs Controlled concrete reasoning Spatial and physical tasks Benchmarks VRQABench mazes and Sokoban OpenWorldQA physical videos Hugging Face downloads Training Method PF-OPSD approach Privileged future frames Teacher-student learning Setup Python and video datasets External LLM API key Three code modules

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Evaluate a vision-language model on spatial reasoning using VRQABench, 4,636 maze and Sokoban puzzle questions with verifiable ground-truth answers.

USE CASE 2

Benchmark a model on predicting physical video outcomes using the 4,404-question OpenWorldQA dataset.

USE CASE 3

Train a model with the PF-OPSD method so it learns to reason about future events even without access to future frames at test time.

USE CASE 4

Use the five-stage AI pipeline to generate your own question-answer dataset from action videos.

Tech stack

PythonHugging Face

Getting it running

Difficulty · hard Time to first run · 1day+

Requires specific video datasets and an external language model API key for dataset construction, use the prebuilt Hugging Face datasets to evaluate without rebuilding.

License terms are not mentioned in the repository description.

In plain English

PF-OPSD is a research project exploring how to combine two types of AI systems: world models, which generate visual predictions of what will happen next in a scene, and multimodal language models, which can reason abstractly about goals, rules, and questions. The authors identify a problem that arises when you simply plug these together: a world model can generate visually plausible future frames that are still wrong for the specific task at hand, and the language model does not automatically know when to trust a simulation or how to weigh it against its own text-based reasoning. The paper, linked on arXiv, calls this challenge controlled concrete reasoning and makes three contributions. The first is VRQABench, a benchmark dataset of 4,636 questions built from maze navigation and Sokoban puzzle images. Because the correct answers to spatial puzzles can be verified programmatically with a search algorithm, the question quality is ground-truth checked rather than hand-labeled. The second is OpenWorldQA, a benchmark of 4,404 questions about predicting physical outcomes from real-world video footage. Questions in this dataset were generated by a five-stage pipeline of AI agents that extracts a pre-event frame from a video, designs plausible question-answer sets, generates misleading but plausible wrong answers, filters out too-easy questions using a smaller model, and accepts only items that pass a quality review. The third contribution is the PF-OPSD training method itself: during training, the AI is given access to ground-truth future video as privileged context that a teacher model can use, and the student model learns to reason as if it had seen those futures even though it will not have access to them at test time. Running the code requires Python, specific video datasets, and an API key for an external language model to drive the dataset construction pipelines. Prebuilt versions of both benchmark datasets are available on Hugging Face for researchers who want to evaluate models without rebuilding from scratch. This repository is aimed at AI researchers working on vision and reasoning. The code is structured into three independent parts covering dataset construction for each benchmark and the training pipeline for the proposed method.

Copy-paste prompts

Prompt 1

I want to evaluate my vision-language model on VRQABench. Write a Python script that loads the dataset from Hugging Face and scores model answers against the maze/Sokoban verifier.

Prompt 2

Explain the PF-OPSD training method: how does the teacher model use privileged future video frames, and how does the student model learn to reason without them at test time?

Prompt 3

I have a set of action videos. Walk me through the five-stage OpenWorldQA pipeline, frame extraction, question design, distractor generation, filtering, and quality review.

Prompt 4

Write a Python evaluation loop that loads the PF-OPSD checkpoint and runs it on the OpenWorldQA test split, reporting accuracy per question category.

Open on GitHub → Explain another repo

← yczhou001 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.