explaingit

avbiswas/sam2-mlx

Analysis updated 2026-06-24

27PythonAudience · researcherComplexity · 4/5Setup · moderate

TLDR

Apple Silicon MLX port of Meta SAM 2.1 that segments and tracks objects in videos locally on a Mac, with an API matching the official SAM2 codebase.

Mindmap

mindmap
  root((sam2-mlx))
    Inputs
      Video frames
      Click prompts
      Box prompts
    Outputs
      Object masks
      Stacked tensors
      Streamed events
    Use Cases
      Track objects in video
      Cut out subjects from clips
      Run SAM2 locally on Apple Silicon
    Tech Stack
      MLX
      Python
      safetensors
      Hugging Face
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Segment and track objects through a video on a Mac without a CUDA GPU

USE CASE 2

Convert Hugging Face SAM 2.1 checkpoints to MLX safetensors for local inference

USE CASE 3

Build a per-frame streaming UI on top of SAM2 with click corrections

What is it built with?

PythonMLXsafetensors

How does it compare?

avbiswas/sam2-mlxmobiusquant/openmobius-skillalicankiraz1/gemma-4-31b-mtp-vllm-server
Stars272726
LanguagePythonPythonPython
Setup difficultymoderatemoderatehard
Complexity4/53/54/5
Audienceresearcherdataops devops

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

Needs Apple Silicon with enough unified memory and Python 3.14, plus a weight conversion step from a Hugging Face SAM2.1 checkpoint.

In plain English

mlx-sam is an Apple Silicon port of Meta's SAM 2.1, a model that cuts objects out of images and tracks them through video. The point of the project is to do this work locally on a Mac using Apple's MLX framework, with Python 3.14 and no PyTorch in the runtime path. PyTorch is only used as an optional extra for converting weights and for comparing results against the official model. The basic workflow matches the upstream SAM2 one. You load a video, click somewhere on an object in a frame, and the model produces a mask for that object and follows it through the rest of the clip. Clicks can be positive (this is part of the object) or negative (this is not). Corrections can be added on any frame and on multiple objects. The propagation can run forward from frame zero, backward from any frame, or both directions from a middle frame to build bidirectional results around an edit point. Box prompts also work. The Python API mirrors the names from the official SAM2 codebase, so methods like from_pretrained, init_state, add_new_points_or_box, propagate_in_video, and reset_state behave the way an existing SAM2 user would expect. There is also a stream_in_video helper that yields per frame events for UI or worker use, and can emit one final stacked mask tensor at the end. Performance has a few knobs. The default image size is 1024 to match SAM2, and lower values trade mask quality for speed and memory. Memory tensors can be kept in float16. An opt in precompute_image_features mode caches image features once during init_state, which makes repeated propagation and correction passes faster at the cost of upfront work. A separate benchmark script also offers a preview temporal downsampling mode that only runs the model on every k-th frame and interpolates the rest. The repo ships extras around the core library. There is a local browser demo at port 7861 launched with mlx-sam-app, an mlx-sam-convert command that turns Hugging Face SAM2.1 checkpoints (tiny, small, base-plus, large) into MLX safetensors, and a feature regression script that compares MLX outputs to PyTorch fixtures. The README lists low level numerical differences around 1e-5, and the model catalog section reports benchmarks on an M2 Max with 32 GB of unified memory.

Copy-paste prompts

Prompt 1
Use sam2-mlx to load a clip, add a positive click on frame 30, and propagate the mask both directions
Prompt 2
Run mlx-sam-convert on the SAM 2.1 large Hugging Face checkpoint and load it via from_pretrained
Prompt 3
Enable precompute_image_features and float16 memory in sam2-mlx for repeated correction passes on the same video
Prompt 4
Use stream_in_video to yield per-frame masks into a websocket and emit the final stacked tensor at the end

Frequently asked questions

What is sam2-mlx?

Apple Silicon MLX port of Meta SAM 2.1 that segments and tracks objects in videos locally on a Mac, with an API matching the official SAM2 codebase.

What language is sam2-mlx written in?

Mainly Python. The stack also includes Python, MLX, safetensors.

How hard is sam2-mlx to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is sam2-mlx for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.