om-ai-lab/vlm-r1

★ 5,956PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((vlm-r1))
    What it does
      RL training for VLMs
      Better generalization
      Visual task learning
    Tasks
      Object detection
      Referring expression
      Math reasoning
    Training
      GRPO algorithm
      LoRA fine-tuning
      Multi-node support
    Audience
      AI researchers
      ML practitioners

mindmap root((vlm-r1)) What it does RL training for VLMs Better generalization Visual task learning Tasks Object detection Referring expression Math reasoning Training GRPO algorithm LoRA fine-tuning Multi-node support Audience AI researchers ML practitioners

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Fine-tune a vision-language model to locate objects in images from text descriptions using GRPO reinforcement learning.

USE CASE 2

Reproduce DeepSeek-R1-style RL training on a multimodal model using Qwen2.5-VL or InternVL as the base.

USE CASE 3

Train a model to detect objects from categories never seen during training using the open-vocabulary detection setup.

USE CASE 4

Run multi-node distributed training of a vision-language model across multiple machines using the provided configuration.

Tech stack

PythonPyTorchLoRAHuggingFaceGRPO

Getting it running

Difficulty · hard Time to first run · 1day+

Full fine-tuning requires multi-GPU CUDA hardware, LoRA mode reduces memory needs but still requires a GPU and a HuggingFace account for model downloads.

In plain English

VLM-R1 is a research project that applies reinforcement learning training techniques to vision-language models. Vision-language models are AI systems that can analyze images and respond to natural language instructions about them. The "R1" name refers to a training style inspired by DeepSeek-R1, a language model that showed strong improvements from reinforcement learning over standard supervised training. The central finding of this project is that reinforcement learning produces better generalization than supervised fine-tuning (SFT). When models are tested on data outside their training set, SFT models begin to perform worse as training continues, while the RL-trained models keep improving. This difference in out-of-domain performance is the core motivation for the work. The project applies this training approach to two visual tasks. The first is Referring Expression Comprehension, where the model locates a specific object in an image based on a natural language description. The second is Open-Vocabulary Detection, where the model detects objects from categories not seen during training. The VLM-R1 math reasoning model also reached first place on a public multimodal math leaderboard for models under 4 billion parameters. Training uses the GRPO algorithm (Group Relative Policy Optimization). The codebase supports full fine-tuning, LoRA fine-tuning (a lighter-weight training approach), multi-node training across multiple machines, and inputs containing multiple images. It works with Qwen2.5-VL and InternVL base models, with documentation for adding new architectures. Pre-trained model checkpoints and datasets are available on HuggingFace, along with interactive demos. A technical report is published on arXiv. The project also includes inference support for Huawei Ascend hardware.

Copy-paste prompts

Prompt 1

I want to fine-tune Qwen2.5-VL on a custom referring expression dataset using VLM-R1's GRPO training. Walk me through the dataset format, config file settings, and the launch command.

Prompt 2

Show me the difference between SFT and GRPO training in VLM-R1, which config file switches the training mode and what key hyperparameters change between the two?

Prompt 3

How do I use LoRA with VLM-R1 to fine-tune a 7B vision-language model on a single GPU with limited VRAM? Show me the relevant config options to reduce memory usage.

Prompt 4

I want to add a new base model architecture to VLM-R1. Where do I register it and what interface does it need to implement to work with the GRPO trainer?

Open on GitHub → Explain another repo

← om-ai-lab on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.