explaingit

om-ai-lab/vlm-r1

5,956PythonAudience · researcherComplexity · 5/5Setup · hard

TLDR

Research framework that trains vision-language AI models using reinforcement learning instead of supervised fine-tuning, achieving better out-of-domain generalization on visual object detection and understanding tasks.

Mindmap

mindmap
  root((vlm-r1))
    What it does
      RL training for VLMs
      Better generalization
      Visual task learning
    Tasks
      Object detection
      Referring expression
      Math reasoning
    Training
      GRPO algorithm
      LoRA fine-tuning
      Multi-node support
    Audience
      AI researchers
      ML practitioners
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Fine-tune a vision-language model to locate objects in images from text descriptions using GRPO reinforcement learning.

USE CASE 2

Reproduce DeepSeek-R1-style RL training on a multimodal model using Qwen2.5-VL or InternVL as the base.

USE CASE 3

Train a model to detect objects from categories never seen during training using the open-vocabulary detection setup.

USE CASE 4

Run multi-node distributed training of a vision-language model across multiple machines using the provided configuration.

Tech stack

PythonPyTorchLoRAHuggingFaceGRPO

Getting it running

Difficulty · hard Time to first run · 1day+

Full fine-tuning requires multi-GPU CUDA hardware, LoRA mode reduces memory needs but still requires a GPU and a HuggingFace account for model downloads.

In plain English

VLM-R1 is a research project that applies reinforcement learning training techniques to vision-language models. Vision-language models are AI systems that can analyze images and respond to natural language instructions about them. The "R1" name refers to a training style inspired by DeepSeek-R1, a language model that showed strong improvements from reinforcement learning over standard supervised training. The central finding of this project is that reinforcement learning produces better generalization than supervised fine-tuning (SFT). When models are tested on data outside their training set, SFT models begin to perform worse as training continues, while the RL-trained models keep improving. This difference in out-of-domain performance is the core motivation for the work. The project applies this training approach to two visual tasks. The first is Referring Expression Comprehension, where the model locates a specific object in an image based on a natural language description. The second is Open-Vocabulary Detection, where the model detects objects from categories not seen during training. The VLM-R1 math reasoning model also reached first place on a public multimodal math leaderboard for models under 4 billion parameters. Training uses the GRPO algorithm (Group Relative Policy Optimization). The codebase supports full fine-tuning, LoRA fine-tuning (a lighter-weight training approach), multi-node training across multiple machines, and inputs containing multiple images. It works with Qwen2.5-VL and InternVL base models, with documentation for adding new architectures. Pre-trained model checkpoints and datasets are available on HuggingFace, along with interactive demos. A technical report is published on arXiv. The project also includes inference support for Huawei Ascend hardware.

Copy-paste prompts

Prompt 1
I want to fine-tune Qwen2.5-VL on a custom referring expression dataset using VLM-R1's GRPO training. Walk me through the dataset format, config file settings, and the launch command.
Prompt 2
Show me the difference between SFT and GRPO training in VLM-R1, which config file switches the training mode and what key hyperparameters change between the two?
Prompt 3
How do I use LoRA with VLM-R1 to fine-tune a 7B vision-language model on a single GPU with limited VRAM? Show me the relevant config options to reduce memory usage.
Prompt 4
I want to add a new base model architecture to VLM-R1. Where do I register it and what interface does it need to implement to work with the GRPO trainer?
Open on GitHub → Explain another repo

← om-ai-lab on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.