explaingit

facebookresearch/sam2

19,182Jupyter NotebookAudience · developerComplexity · 3/5MaintainedLicenseSetup · hard

TLDR

AI model that automatically outlines and tracks objects in photos and videos by learning where they are and following them frame-by-frame.

Mindmap

mindmap
  root((SAM 2))
    What it does
      Segment objects in images
      Track objects in videos
      Memory across frames
    How it works
      Transformer architecture
      Streaming memory system
      Multiple size variants
    Use cases
      Video editing and effects
      Medical image analysis
      Training data labeling
      Object detection apps
    Tech stack
      Python PyTorch
      GPU acceleration
      Jupyter notebooks
    Getting started
      Click to select objects
      Works on photos and video
      Requires GPU hardware

Things people build with this

USE CASE 1

Cut out subjects from video clips for editing by automatically tracing object boundaries frame-by-frame.

USE CASE 2

Label objects in photos and videos to create training datasets for other AI models.

USE CASE 3

Analyze medical scans by automatically segmenting organs or tumors to assist diagnosis.

USE CASE 4

Build apps that understand where objects are in images by detecting and outlining them automatically.

Tech stack

PythonPyTorchTransformerCUDAJupyter Notebook

Getting it running

Difficulty · hard Time to first run · 1h+

Requires CUDA/GPU setup and PyTorch compilation; model weights download and inference optimization needed.

Use freely for research and commercial purposes under the CC-BY-NC license, with restrictions on commercial use without permission.

In plain English

SAM 2 (Segment Anything Model 2) is an AI model from Meta's research lab that can automatically identify and outline any object in a photo or video, a task called "image segmentation." You point it at an object (by clicking, drawing a box, or specifying a point), and it precisely traces the boundary of that object. The key upgrade over the original SAM is that it works on video too, tracking the object frame-by-frame across the entire clip, even as the object moves or partially disappears. Under the hood, it uses a transformer architecture, the same family of neural networks behind modern language models, plus a "streaming memory" system that lets it remember where an object was in previous frames to keep tracking it in later ones. Meta also released a large new video segmentation dataset (SA-V) that was used to train the model. Multiple size variants are available (tiny, small, base plus, large), and the model can be compiled for faster video processing. You'd use this when you need to isolate objects in photos or videos: cutting out subjects for video editing, training other AI models that need labeled object data, analyzing medical scans, or building apps that need to "understand" where things are in an image. It requires Python 3.10 or higher, PyTorch 2.5.1 or higher, and a GPU. Usage examples are provided as Jupyter notebooks.

Copy-paste prompts

Prompt 1
Show me how to use SAM 2 to segment an object in a single image by clicking on it.
Prompt 2
How do I track an object across an entire video using SAM 2's streaming memory?
Prompt 3
What's the difference between the tiny, small, base, and large SAM 2 model variants, and which should I use?
Prompt 4
How can I use SAM 2 to automatically label objects in a dataset of images for training another model?
Prompt 5
Show me how to compile SAM 2 for faster video processing on my GPU.
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.