Fine-tune a vision-language model to locate objects in images from text descriptions using GRPO reinforcement learning.
Reproduce DeepSeek-R1-style RL training on a multimodal model using Qwen2.5-VL or InternVL as the base.
Train a model to detect objects from categories never seen during training using the open-vocabulary detection setup.
Run multi-node distributed training of a vision-language model across multiple machines using the provided configuration.
Full fine-tuning requires multi-GPU CUDA hardware, LoRA mode reduces memory needs but still requires a GPU and a HuggingFace account for model downloads.
VLM-R1 is a research project that applies reinforcement learning training techniques to vision-language models. Vision-language models are AI systems that can analyze images and respond to natural language instructions about them. The "R1" name refers to a training style inspired by DeepSeek-R1, a language model that showed strong improvements from reinforcement learning over standard supervised training. The central finding of this project is that reinforcement learning produces better generalization than supervised fine-tuning (SFT). When models are tested on data outside their training set, SFT models begin to perform worse as training continues, while the RL-trained models keep improving. This difference in out-of-domain performance is the core motivation for the work. The project applies this training approach to two visual tasks. The first is Referring Expression Comprehension, where the model locates a specific object in an image based on a natural language description. The second is Open-Vocabulary Detection, where the model detects objects from categories not seen during training. The VLM-R1 math reasoning model also reached first place on a public multimodal math leaderboard for models under 4 billion parameters. Training uses the GRPO algorithm (Group Relative Policy Optimization). The codebase supports full fine-tuning, LoRA fine-tuning (a lighter-weight training approach), multi-node training across multiple machines, and inputs containing multiple images. It works with Qwen2.5-VL and InternVL base models, with documentation for adding new architectures. Pre-trained model checkpoints and datasets are available on HuggingFace, along with interactive demos. A technical report is published on arXiv. The project also includes inference support for Huawei Ascend hardware.
← om-ai-lab on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.