Study a two-stage teacher-student training pipeline for improving visual spatial reasoning in AI models
Evaluate a vision-language model's ability to navigate FrozenLake, maze, and MiniBehaviour tasks from images alone
Use the perception SFT pipeline as a template for training a model to describe visual task states before planning
Requires three separate Python environments and pretrained checkpoints that are not yet publicly released, making full replication currently impossible.
MGSD is a research project exploring how vision-language models, which are AI systems that can process both text and images, can learn to plan through visually presented spatial tasks. The core challenge the researchers address is a gap between what a model can understand when given text descriptions of a situation versus when it has to interpret the same situation from an image. The work is described in an academic paper published on arXiv. The training process has two stages. In the first stage, called cold-start perception SFT, the model is trained to recognize and describe the state of a task from an image before it is asked to make any planning decisions. This is meant to give the model a grounded understanding of what it is looking at. In the second stage, called OPCD training, a text-only version of the model acts as a teacher that sees symbolic descriptions of a task, while a visual version of the model acts as a student that sees images of the same task. The student learns by comparing its reasoning to the teacher's. The code supports three tasks. FrozenLake is a grid-based navigation challenge where the model must reach a goal while avoiding holes. Maze asks the model to figure out which corridors are open and plan a path through them. MiniBehaviour involves picking up a specific object (a printer) and placing it next to another object (a table). All three tasks are visual: the model receives an image rather than a symbolic description of the environment. Practically, running this code requires setting up three separate Python environments because the training and evaluation pipelines rely on different dependencies. The repository organizes these into a perception SFT pipeline, a reinforcement-learning-style OPCD training pipeline, and an evaluation toolkit. The actual training data and pretrained checkpoints are noted as not yet released, so replicating the results from scratch is not possible at the time of this writing. This repository is aimed at researchers working on multimodal AI and spatial reasoning. Non-technical users would not have a direct use for the code, but the project goal, teaching an AI to look at a picture of a maze and figure out how to walk through it, is broadly approachable as a concept.
← oranger-l on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.