SetCon is the official code release for a research paper titled SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction. Referring segmentation is the task of, given a text description, drawing a precise mask around the matching object or objects in an image or video. The phrase open-ended means the description can ask for multiple targets at once and is not limited to a fixed vocabulary. The README's Highlights section describes the technical idea in three points. First, SetCon reframes the problem as set-level concept prediction, instead of treating each target as an independent special-token output the way some earlier methods do. Second, it connects the reasoning of a large vision-language model with a mask decoder through a language-grounded concept interface, organized into a global set-level concept that defines the overall target scope and finer sub-concepts that map to subsets of the targets. Third, the same interface works for both still images and videos, producing a full set of masks for an image and temporally propagated masks across video frames. The project ships as a Python codebase that needs Python 3.11. The README explains how to install dependencies with the uv package manager (uv sync --extra latest), download a pretrained SetCon-8B checkpoint from Hugging Face, and put it in a saved_models directory. A demo.py script runs the model on a single image with a natural-language query like asking which curtains in a room someone would have to address to pull them all down. For benchmarking, evaluation scripts live under projects/setcon/evaluation, with separate entry points for image and video. The image evaluator selects between three benchmark families (grefcoco, muse, refcoco) and the video evaluator additionally needs a SAM 3 checkpoint. Training is supported too: you download Qwen3-VL-8B-Instruct and SAM 3 as starting weights, plus the SetCon training annotations on Hugging Face, then run a distributed training script that the authors suggest using on at least 8 A100 GPUs. The repository is licensed under Apache 2.0 and credits the SAM 3 and Sa2VA projects for the components it builds on. A BibTeX entry is provided for citing the paper (arXiv 2605.20110).
Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.