Segment objects in an image by typing a plain-text description of what you want to select.
Combine a scribble and a text label to precisely identify and mask a region in a photo.
Run interactive multi-round segmentation where the model remembers earlier selections in a session.
Try the model using pre-trained weights from Hugging Face without training it yourself.
Requires a Linux machine with a GPU and pre-trained model weights downloaded from Hugging Face.
SEEM, short for Segment Everything Everywhere with Multi-modal prompts, is an AI research project that can identify and outline objects in images based on a wide variety of input types. Most image segmentation tools ask you to draw a box around an object or click on it. SEEM goes further: you can describe what you want in plain text, click a point, draw a rough scribble, or even provide a picture of a reference object, and the model will find and mask the matching region. This research was published at NeurIPS 2023. The core idea is a single model that handles many different ways of specifying what to segment, instead of training a separate model for each input type. You can combine these prompts however you want. For example, you could describe an object in text and also draw a scribble to narrow down where it is. The model also supports multi-round interaction, meaning it can remember context from earlier steps in a session rather than treating every request from scratch. Setting up and running a demo requires a Linux environment with a GPU, Python, and a few dependencies the repository documents. The README includes a one-line command that clones the repository and launches a local demo. Pre-trained model weights are available for download from Hugging Face, so you do not have to train the model yourself to try it out. The project is built on top of an earlier model called X-Decoder, which is a general-purpose visual decoder the same research group released. SEEM adds interactive segmentation features on top of that foundation. The README also links to related projects including a tool called Semantic-SAM, which focuses on recognizing objects at different levels of detail, and Grounding SAM, which combines object detection with segmentation. The repository is primarily a research release. It contains demo code, configuration files, and links to model checkpoints, along with benchmark numbers comparing SEEM against other methods on standard image segmentation datasets. The training code for the underlying X-Decoder model was released separately, the README notes that SEEM training code was planned for a follow-up release.
← ux-decoder on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.