explaingit

oranger-l/mgsd

17PythonAudience · researcherComplexity · 5/5Setup · hard

TLDR

A research codebase for training AI models to plan through visual spatial tasks like mazes by learning from both image and text descriptions of the same environment using a two-stage teacher-student method.

Mindmap

mindmap
  root((mgsd))
    Training stages
      Cold-start perception
      OPCD teacher-student
    Tasks
      FrozenLake grid
      Maze navigation
      MiniBehaviour pickup
    Model types
      Vision-language model
      Text-only teacher
      Visual student
    Setup
      3 Python environments
      Perception SFT pipeline
      OPCD training pipeline
      Evaluation toolkit
    Status
      Training data unreleased
      Checkpoints unreleased
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Study a two-stage teacher-student training pipeline for improving visual spatial reasoning in AI models

USE CASE 2

Evaluate a vision-language model's ability to navigate FrozenLake, maze, and MiniBehaviour tasks from images alone

USE CASE 3

Use the perception SFT pipeline as a template for training a model to describe visual task states before planning

Tech stack

Python

Getting it running

Difficulty · hard Time to first run · 1day+

Requires three separate Python environments and pretrained checkpoints that are not yet publicly released, making full replication currently impossible.

No license information was provided in the explanation.

In plain English

MGSD is a research project exploring how vision-language models, which are AI systems that can process both text and images, can learn to plan through visually presented spatial tasks. The core challenge the researchers address is a gap between what a model can understand when given text descriptions of a situation versus when it has to interpret the same situation from an image. The work is described in an academic paper published on arXiv. The training process has two stages. In the first stage, called cold-start perception SFT, the model is trained to recognize and describe the state of a task from an image before it is asked to make any planning decisions. This is meant to give the model a grounded understanding of what it is looking at. In the second stage, called OPCD training, a text-only version of the model acts as a teacher that sees symbolic descriptions of a task, while a visual version of the model acts as a student that sees images of the same task. The student learns by comparing its reasoning to the teacher's. The code supports three tasks. FrozenLake is a grid-based navigation challenge where the model must reach a goal while avoiding holes. Maze asks the model to figure out which corridors are open and plan a path through them. MiniBehaviour involves picking up a specific object (a printer) and placing it next to another object (a table). All three tasks are visual: the model receives an image rather than a symbolic description of the environment. Practically, running this code requires setting up three separate Python environments because the training and evaluation pipelines rely on different dependencies. The repository organizes these into a perception SFT pipeline, a reinforcement-learning-style OPCD training pipeline, and an evaluation toolkit. The actual training data and pretrained checkpoints are noted as not yet released, so replicating the results from scratch is not possible at the time of this writing. This repository is aimed at researchers working on multimodal AI and spatial reasoning. Non-technical users would not have a direct use for the code, but the project goal, teaching an AI to look at a picture of a maze and figure out how to walk through it, is broadly approachable as a concept.

Copy-paste prompts

Prompt 1
I am a researcher reading the MGSD paper. Explain the OPCD training stage in simple terms: what does the text-only teacher do, what does the visual student learn from it, and why does this help?
Prompt 2
I want to set up the three Python environments for mgsd on a fresh Ubuntu machine. List each environment, what it is for, and the key dependencies I need to install in each.
Prompt 3
The mgsd evaluation toolkit tests FrozenLake, Maze, and MiniBehaviour. Write a script that runs the evaluation on all three tasks and outputs a summary table of success rates.
Prompt 4
I want to adapt the mgsd cold-start perception SFT pipeline to a new visual task I designed. What files do I need to modify and what format should my new task images and labels be in?
Prompt 5
The mgsd pretrained checkpoints are not yet released. While I wait, what publicly available vision-language model checkpoints would be the best starting point for the perception SFT stage?
Open on GitHub → Explain another repo

← oranger-l on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.