Evaluate a multimodal AI vision model's spatial reasoning ability using the ReasonMatch-Bench dataset and compare results at different difficulty levels.
Train a vision model to match corresponding points across wide-baseline image pairs using Dynamic Correspondence Reinforcement Learning.
Use the benchmark evaluation scripts to test any API-accessible vision model on the wide-baseline matching task without retraining.
Study whether a vision-language model truly understands 3D geometry or is just matching visual textures by running it through the graded benchmark.
Training requires the veRL distributed RL framework and multiple GPUs, training data from the paper is not publicly released.
ReasonMatch is a research project from Zhejiang University, Ant Group, and Westlake University, published at CVPR 2026. It focuses on testing and improving how well AI vision models can reason about spatial relationships in images, specifically the problem of matching points across two photos of the same scene taken from very different angles. The task the researchers set out to study is called wide-baseline matching. If you take two photos of the same building from opposite sides, figuring out which pixel in photo A corresponds to which pixel in photo B requires understanding geometry, how objects change appearance as viewpoint changes, and how parts of the scene may be hidden in one photo but visible in another. The researchers argue this is a useful test of whether a multimodal AI model (one that can see and reason) genuinely understands space, rather than just pattern-matching on texture. The repository contains two main contributions. The first is ReasonMatch-Bench, a dataset and evaluation suite that grades models on this matching task at different levels of difficulty, varying both how far apart the two camera positions are and how fine-grained the matching needs to be. The second is a training method called Dynamic Correspondence Reinforcement Learning, which teaches models to do this task using reinforcement learning rather than step-by-step chain-of-thought supervision. For people wanting to evaluate models, the dataset is downloadable from Hugging Face or a ModelScope mirror, and evaluation scripts run against any model exposed through a standard API. Training code is included for researchers who have their own data in the required format, though the training data used for the paper is not publicly released. The code depends on a framework called veRL, which handles distributed reinforcement learning training across multiple machines and GPUs. Python 3.10 or newer is required. This is a research codebase intended for AI researchers studying multimodal reasoning.
← aim-uofa on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.