apple/ml-ferret

★ 8,687PythonAudience · researcherComplexity · 5/5LicenseSetup · hard

Mindmap

mindmap
  root((Ferret))
    What it does
      Visual grounding
      Region-based QA
      Spatial references
    Components
      Ferret model
      GRIT dataset
      Ferret-Bench eval
      Ferret-UI variant
    Requirements
      Multiple 80GB GPUs
      Local server setup
      Research license
    Use Cases
      Spatial AI research
      UI understanding
      Grounding evaluation

mindmap root((Ferret)) What it does Visual grounding Region-based QA Spatial references Components Ferret model GRIT dataset Ferret-Bench eval Ferret-UI variant Requirements Multiple 80GB GPUs Local server setup Research license Use Cases Spatial AI research UI understanding Grounding evaluation

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Research fine-grained visual grounding where a model answers questions about a specific drawn region

USE CASE 2

Evaluate how well a vision model handles spatial references using Ferret-Bench

USE CASE 3

Train or fine-tune a grounding model using the GRIT dataset of 1.1 million region-text examples

USE CASE 4

Study AI understanding of UI screenshots and interface elements with Ferret-UI

Tech stack

PythonPyTorch

Getting it running

Difficulty · hard Time to first run · 1day+

Training requires 8 GPUs with 80 GB VRAM each, even the demo needs a compatible GPU and multiple running server processes.

Released for non-commercial research use only, cannot be used in commercial products or services.

In plain English

Ferret is a research project from Apple that explores a specific capability in AI vision models: the ability to point at a specific region of an image and ask questions about it, and to receive answers that also point back to specific locations in the image. Most AI image models can describe a whole image or answer general questions about it, but Ferret is designed to work with fine-grained references, such as a drawn box, a dot, or a freehand scribble, and respond by identifying where specific things are in the image. The project includes three components. The Ferret model is the core research model that accepts image regions as input and produces responses that refer to image locations. GRIT is a large dataset of about 1.1 million examples used to train the model on this type of grounding and referring task. Ferret-Bench is an evaluation dataset for testing how well models handle this combination of visual reasoning, knowledge, and spatial grounding. A follow-on version called Ferret-UI applies the same ideas specifically to user interface screenshots, enabling the model to understand and reason about buttons, menus, and other UI elements in a screen image. Using or running Ferret requires significant GPU resources. Training was done on 8 GPUs with 80 GB of memory each. To run the interactive demo, you need to download the model weights, set up several server processes locally, and have a compatible GPU available. The code and data are released for research use only under non-commercial licenses. The model was published at ICLR 2024 as a spotlight paper. This is an academic research release, not a finished product.

Copy-paste prompts

Prompt 1

How do I set up the Ferret model locally to run the interactive demo, what GPU and server processes are required?

Prompt 2

Walk me through how to use a freehand scribble as input to the Ferret model and get a spatially grounded response.

Prompt 3

How does Ferret-UI differ from the base Ferret model, and what kinds of UI questions can it answer?

Prompt 4

What is the GRIT dataset used for in Ferret, and how was it created?

Prompt 5

How do I evaluate a vision model on Ferret-Bench to measure spatial grounding performance?

Open on GitHub → Explain another repo

← apple on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.