Analysis updated 2026-06-24
Run demo.py on an image with a natural-language query to get masks for matching objects
Benchmark SetCon on grefcoco, muse, or refcoco using the image evaluator script
Evaluate SetCon on video benchmarks alongside a SAM 3 checkpoint
Fine-tune SetCon on a custom dataset using the distributed training script on 8 or more A100s
| rookiexiong7/setcon | 1lystore/awaek | actashui/sjtu-ppt-template-skill | |
|---|---|---|---|
| Stars | 13 | 13 | 13 |
| Language | Python | Python | Python |
| Setup difficulty | hard | moderate | moderate |
| Complexity | 5/5 | 2/5 | 2/5 |
| Audience | researcher | vibe coder | researcher |
Figures from each repo's GitHub metadata at analysis time.
Training needs Qwen3-VL-8B and SAM 3 weights plus at least 8 A100 GPUs, so the full pipeline is not a laptop job.
SetCon is the official code release for a research paper titled SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction. Referring segmentation is the task of, given a text description, drawing a precise mask around the matching object or objects in an image or video. The phrase open-ended means the description can ask for multiple targets at once and is not limited to a fixed vocabulary. The README's Highlights section describes the technical idea in three points. First, SetCon reframes the problem as set-level concept prediction, instead of treating each target as an independent special-token output the way some earlier methods do. Second, it connects the reasoning of a large vision-language model with a mask decoder through a language-grounded concept interface, organized into a global set-level concept that defines the overall target scope and finer sub-concepts that map to subsets of the targets. Third, the same interface works for both still images and videos, producing a full set of masks for an image and temporally propagated masks across video frames. The project ships as a Python codebase that needs Python 3.11. The README explains how to install dependencies with the uv package manager (uv sync --extra latest), download a pretrained SetCon-8B checkpoint from Hugging Face, and put it in a saved_models directory. A demo.py script runs the model on a single image with a natural-language query like asking which curtains in a room someone would have to address to pull them all down. For benchmarking, evaluation scripts live under projects/setcon/evaluation, with separate entry points for image and video. The image evaluator selects between three benchmark families (grefcoco, muse, refcoco) and the video evaluator additionally needs a SAM 3 checkpoint. Training is supported too: you download Qwen3-VL-8B-Instruct and SAM 3 as starting weights, plus the SetCon training annotations on Hugging Face, then run a distributed training script that the authors suggest using on at least 8 A100 GPUs. The repository is licensed under Apache 2.0 and credits the SAM 3 and Sa2VA projects for the components it builds on. A BibTeX entry is provided for citing the paper (arXiv 2605.20110).
Official code for SetCon, a referring segmentation model that takes a text description and produces masks for all matching objects in images or video frames.
Mainly Python. The stack also includes Python, uv, Qwen3-VL.
Apache 2.0 license, so you can use, modify, and ship it commercially as long as you keep the notice and state your changes.
Setup difficulty is rated hard, with roughly 1day+ to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.