idea-research/grounded-segment-anything

★ 17,572Jupyter NotebookAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((Grounded SAM))
    What it does
      Text-prompt detection
      Pixel masking
      Image editing
    Components
      Grounding DINO
      Segment Anything
      Stable Diffusion
      RAM tagger
    Use Cases
      Data labeling
      Region editing
      Auto tagging
    Demos
      Hugging Face
      Colab Replicate

mindmap root((Grounded SAM)) What it does Text-prompt detection Pixel masking Image editing Components Grounding DINO Segment Anything Stable Diffusion RAM tagger Use Cases Data labeling Region editing Auto tagging Demos Hugging Face Colab Replicate

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Automatically label objects in images for a training dataset by describing what to find in plain text instead of drawing boxes manually.

USE CASE 2

Edit a specific region of a photo, like replacing a background, using a text description to select it and Stable Diffusion to change it.

USE CASE 3

Generate descriptive tags for a library of images automatically using the RAM tagging model in the pipeline.

USE CASE 4

Track specific objects across video frames using the follow-on Grounded SAM 2 project.

Tech stack

PythonPyTorchJupyter NotebookStable DiffusionHugging Face

Getting it running

Difficulty · hard Time to first run · 1h+

Requires a GPU and multiple large model downloads, easiest to start via the hosted Hugging Face Spaces or Colab demo.

In plain English

Grounded-Segment-Anything is a project from IDEA Research that wires several open-source AI vision models together into one pipeline. The idea is to point at any object in an image just by typing a word, for example, "the red bag", and have the system find that object, draw a tight outline around it, and optionally edit or describe it. It does this by chaining specialist models. Grounding DINO is an open-vocabulary object detector, meaning you give it a text prompt and it locates whatever you described in the image with a box, without needing to be retrained for each new category. Segment Anything (SAM) takes those boxes and produces a precise pixel-level mask, the actual outline of the object. The pipeline can then hand that mask to Stable Diffusion to edit the region, or to Recognize Anything (RAM) to automatically generate descriptive tags. The README is explicit that all parts are independent: any piece can be used on its own or replaced with a similar model, like swapping in a different detector or a different image generator. You would reach for it when you need to find and outline objects in images from a text description rather than from labeled training data. The README highlights uses like automatic data labeling, open-vocabulary detection and segmentation, image editing, and data generation. There is also a follow-on project, Grounded SAM 2, for tracking objects across video. The repository is primarily Jupyter Notebooks, so you can run and inspect each stage interactively, and the project lists hosted demos on Hugging Face Spaces, Colab, Replicate, and ModelScope. The full README is longer than what was provided.

Copy-paste prompts

Prompt 1

Use Grounded-Segment-Anything to find all the cars in an image using a text prompt and draw outlines around them, show me the Python code end to end.

Prompt 2

I want to auto-label a dataset of product photos by typing the product name. Walk me through using Grounded-SAM to produce bounding boxes and pixel masks.

Prompt 3

Use Stable Diffusion with Grounded-SAM to replace the background of a person in a photo while keeping the person untouched, show me the full pipeline.

Prompt 4

Run the Grounded-SAM demo on Hugging Face Spaces to segment objects in my own image without installing anything locally, what are the steps?

Open on GitHub → Explain another repo

← idea-research on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.