aim-uofa/reasonmatch

★ 12PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((reasonmatch))
    What it does
      Wide-baseline matching
      Spatial reasoning eval
      RL-based training
    Two contributions
      ReasonMatch-Bench dataset
      Dynamic Correspondence RL
    Evaluation
      Hugging Face dataset
      API-based model eval
      Difficulty levels
    Requirements
      Python 3.10
      veRL framework
      Multi-GPU training

mindmap root((reasonmatch)) What it does Wide-baseline matching Spatial reasoning eval RL-based training Two contributions ReasonMatch-Bench dataset Dynamic Correspondence RL Evaluation Hugging Face dataset API-based model eval Difficulty levels Requirements Python 3.10 veRL framework Multi-GPU training

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Evaluate a multimodal AI vision model's spatial reasoning ability using the ReasonMatch-Bench dataset and compare results at different difficulty levels.

USE CASE 2

Train a vision model to match corresponding points across wide-baseline image pairs using Dynamic Correspondence Reinforcement Learning.

USE CASE 3

Use the benchmark evaluation scripts to test any API-accessible vision model on the wide-baseline matching task without retraining.

USE CASE 4

Study whether a vision-language model truly understands 3D geometry or is just matching visual textures by running it through the graded benchmark.

Tech stack

PythonPyTorchveRLHugging Face

Getting it running

Difficulty · hard Time to first run · 1day+

Training requires the veRL distributed RL framework and multiple GPUs, training data from the paper is not publicly released.

No license information was mentioned in the explanation.

In plain English

ReasonMatch is a research project from Zhejiang University, Ant Group, and Westlake University, published at CVPR 2026. It focuses on testing and improving how well AI vision models can reason about spatial relationships in images, specifically the problem of matching points across two photos of the same scene taken from very different angles. The task the researchers set out to study is called wide-baseline matching. If you take two photos of the same building from opposite sides, figuring out which pixel in photo A corresponds to which pixel in photo B requires understanding geometry, how objects change appearance as viewpoint changes, and how parts of the scene may be hidden in one photo but visible in another. The researchers argue this is a useful test of whether a multimodal AI model (one that can see and reason) genuinely understands space, rather than just pattern-matching on texture. The repository contains two main contributions. The first is ReasonMatch-Bench, a dataset and evaluation suite that grades models on this matching task at different levels of difficulty, varying both how far apart the two camera positions are and how fine-grained the matching needs to be. The second is a training method called Dynamic Correspondence Reinforcement Learning, which teaches models to do this task using reinforcement learning rather than step-by-step chain-of-thought supervision. For people wanting to evaluate models, the dataset is downloadable from Hugging Face or a ModelScope mirror, and evaluation scripts run against any model exposed through a standard API. Training code is included for researchers who have their own data in the required format, though the training data used for the paper is not publicly released. The code depends on a framework called veRL, which handles distributed reinforcement learning training across multiple machines and GPUs. Python 3.10 or newer is required. This is a research codebase intended for AI researchers studying multimodal reasoning.

Copy-paste prompts

Prompt 1

Download the ReasonMatch-Bench dataset from Hugging Face and run the evaluation scripts against GPT-4o to measure its wide-baseline image-matching accuracy at each difficulty level.

Prompt 2

Set up the veRL distributed training environment for ReasonMatch and explain the Dynamic Correspondence Reinforcement Learning training loop, what reward signal is used and how is it computed?

Prompt 3

I have my own wide-baseline image pair dataset in the required format. Show me how to adapt the ReasonMatch training code to fine-tune a multimodal model on my data.

Prompt 4

Explain the difference between training with Dynamic Correspondence RL and chain-of-thought supervision for spatial matching, what does each method teach the model to do differently?

Prompt 5

Run the ReasonMatch-Bench evaluation on a locally hosted vision model exposed through an OpenAI-compatible API and produce a results table broken down by difficulty tier.

Open on GitHub → Explain another repo

← aim-uofa on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.