explaingit

idea-research/groundingdino

10,102PythonAudience · researcherComplexity · 3/5Setup · moderate

TLDR

An AI model that finds and locates objects in images based on text descriptions you write, instead of being limited to a fixed list of pre-trained categories, published at ECCV 2024.

Mindmap

mindmap
  root((GroundingDINO))
    What it does
      Text-guided detection
      Open-set recognition
      Zero-shot benchmarks
    Tech Stack
      Python
      PyTorch
      Hugging Face
    Use Cases
      Auto image annotation
      Video object tracking
      Image editing
    Integrations
      SAM segmentation
      Stable Diffusion
      Grounded SAM 2
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Automatically detect and locate any object in an image by typing a text description, without training a custom model.

USE CASE 2

Combine with Segment Anything Model to draw precise outlines around objects you describe in plain language.

USE CASE 3

Build automated dataset annotation pipelines that label images using text prompts instead of manual bounding boxes.

USE CASE 4

Create image editing workflows that find specific regions by text description and apply modifications to just those areas.

Tech stack

PythonPyTorchHugging Face Transformers

Getting it running

Difficulty · moderate Time to first run · 30min

Requires PyTorch with a GPU strongly recommended, pretrained model weights must be downloaded separately.

In plain English

Grounding DINO is an AI model that finds and locates objects in images based on text descriptions you provide. Traditional object detection models can only identify things from a fixed list of categories they were trained on, such as car, person, or chair. Grounding DINO takes a different approach: you describe what you want to find in plain language, and the model searches the image for it. This is called open-set detection, because the set of things it can detect is open-ended rather than fixed at training time. The model combines two existing ideas. DINO is a vision model that learns to represent images by understanding relationships between patches of pixels. The project pairs that with a language understanding component so that text descriptions and image regions can be matched against each other. The result is a model that scores highly on standard object detection benchmarks even without being specifically trained on those datasets: the README reports a 52.5 AP score on the COCO benchmark with zero COCO training data, and 63.0 AP when fine-tuned. The research was published at ECCV 2024 and is implemented in Python using PyTorch. Pretrained model weights are available for download, and the model can be loaded through Hugging Face's transformers library. A live demo runs on Hugging Face Spaces. The repository also documents how Grounding DINO can be combined with other models. Pairing it with Segment Anything Model (SAM) lets you not just find objects but also draw precise outlines around them, including tracking them across video frames with Grounded SAM 2. Pairing it with Stable Diffusion opens up uses in image editing, where you first locate a region with text and then modify it. A newer version called Grounding DINO 1.5 with higher capability is available separately through an API. The project is widely used in computer vision research pipelines, automated dataset annotation, and building custom detection systems without training data.

Copy-paste prompts

Prompt 1
Use Grounding DINO to detect all the cats and red chairs in this photo. Show me the Python code to draw bounding boxes with confidence scores.
Prompt 2
How do I combine Grounding DINO with Segment Anything Model to automatically segment objects I describe in text across video frames?
Prompt 3
Write a Python script that takes a folder of product images and a list of text labels, runs Grounding DINO, and outputs bounding boxes in COCO JSON format.
Prompt 4
Set up Grounding DINO via Hugging Face transformers and run inference on a custom image with a custom text prompt, show me the minimal code.
Prompt 5
How do I fine-tune Grounding DINO on my own labeled dataset to improve detection accuracy for specific domain objects?
Open on GitHub → Explain another repo

← idea-research on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.