explaingit

humanmllm/swim

75Python

TLDR

SWIM is the official code release for a research paper that will appear at CVPR 2026.

Mindmap

A visual breakdown will appear here once this repo is fully enriched.

In plain English

SWIM is the official code release for a research paper that will appear at CVPR 2026. The full name is See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding. The work comes from Nankai University and Alibaba's Tongyi Lab, and the goal is to make a multimodal large language model describe a specific object in a video accurately when the user simply names that object in words. The motivation, as stated in the README, is that existing video models often hallucinate when asked about one particular object in a busy scene, partly because their attention drifts to the wrong regions of the frame. SWIM adds a training signal that watches the attention maps directly. During supervised fine-tuning, tokens that name the target entity are wrapped in special <ins>...</ins> tags, and an extra loss term encourages the model to focus its attention on the visual region that actually corresponds to those tagged tokens. Alongside the model, the authors publish a dataset called NL-Refer. It is built on top of an existing dataset called VideoRefer-700K, which marked target objects by drawing coloured masks on video frames. NL-Refer replaces those visual prompts with natural-language descriptions written by GPT-4o, with the key referring word tagged for attention supervision. The release includes about 125 thousand detailed-caption samples, 10 thousand question-answer samples, and benchmark splits for description generation and multiple-choice question answering. The shipped model is SWIM-7B, fine-tuned from Qwen2.5-VL-7B-Instruct, and it uses the same inference API as the base model. The README walks through a Python example with the Hugging Face Transformers library and the qwen_vl_utils helper. Only the language model side is updated during fine-tuning, while the vision encoder stays frozen to keep training affordable. Training itself runs on eight GPUs using DeepSpeed Zero-3, BF16 precision, and Flash Attention 2. One installation detail to watch: SWIM depends on a customised fork of Hugging Face Transformers that is shipped inside the repository, in a transformers/ folder. You must install that local copy in editable mode, not the version from PyPI, because the attention supervision logic lives in the fork. Model weights and the NL-Refer dataset are hosted on Hugging Face under the BBBBCHAN account.

Open on GitHub → Explain another repo

Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.