humanmllm/swim

Analysis updated 2026-06-24

★ 75PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((SWIM))
    Inputs
      Video frames
      Natural language reference
      Question prompt
    Outputs
      Object caption
      Question answer
      Attention maps
    Use Cases
      Fine-grained video QA
      Object-referring captioning
      Attention-aligned training research
    Tech Stack
      Python
      PyTorch
      Transformers
      DeepSpeed
      FlashAttention
      Qwen2.5-VL

mindmap root((SWIM)) Inputs Video frames Natural language reference Question prompt Outputs Object caption Question answer Attention maps Use Cases Fine-grained video QA Object-referring captioning Attention-aligned training research Tech Stack Python PyTorch Transformers DeepSpeed FlashAttention Qwen2.5-VL

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Run video object-referring inference with SWIM-7B on your own clips

USE CASE 2

Fine-tune a vision-language model using the NL-Refer dataset and attention supervision loss

USE CASE 3

Reproduce the CVPR 2026 benchmark numbers on description and multiple-choice QA

USE CASE 4

Adapt the <ins> tagging trick to align attention with referring tokens in another MLLM

What is it built with?

PythonPyTorchTransformersDeepSpeedFlashAttention

How does it compare?

	humanmllm/swim	tencent-hunyuan/hy-mt2	krishnaik06/multiple-linear-regression
Stars	75	76	77
Language	Python	Python	Python
Last pushed	—	—	2019-01-31
Maintenance	—	—	Dormant
Setup difficulty	hard	hard	easy
Complexity	5/5	4/5	1/5
Audience	researcher	researcher	general

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Requires installing the bundled transformers fork in editable mode and 8 GPUs with DeepSpeed Zero-3 for training.

In plain English

SWIM is the official code release for a research paper that will appear at CVPR 2026. The full name is See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding. The work comes from Nankai University and Alibaba's Tongyi Lab, and the goal is to make a multimodal large language model describe a specific object in a video accurately when the user simply names that object in words. The motivation, as stated in the README, is that existing video models often hallucinate when asked about one particular object in a busy scene, partly because their attention drifts to the wrong regions of the frame. SWIM adds a training signal that watches the attention maps directly. During supervised fine-tuning, tokens that name the target entity are wrapped in special <ins>...</ins> tags, and an extra loss term encourages the model to focus its attention on the visual region that actually corresponds to those tagged tokens. Alongside the model, the authors publish a dataset called NL-Refer. It is built on top of an existing dataset called VideoRefer-700K, which marked target objects by drawing coloured masks on video frames. NL-Refer replaces those visual prompts with natural-language descriptions written by GPT-4o, with the key referring word tagged for attention supervision. The release includes about 125 thousand detailed-caption samples, 10 thousand question-answer samples, and benchmark splits for description generation and multiple-choice question answering. The shipped model is SWIM-7B, fine-tuned from Qwen2.5-VL-7B-Instruct, and it uses the same inference API as the base model. The README walks through a Python example with the Hugging Face Transformers library and the qwen_vl_utils helper. Only the language model side is updated during fine-tuning, while the vision encoder stays frozen to keep training affordable. Training itself runs on eight GPUs using DeepSpeed Zero-3, BF16 precision, and Flash Attention 2. One installation detail to watch: SWIM depends on a customised fork of Hugging Face Transformers that is shipped inside the repository, in a transformers/ folder. You must install that local copy in editable mode, not the version from PyPI, because the attention supervision logic lives in the fork. Model weights and the NL-Refer dataset are hosted on Hugging Face under the BBBBCHAN account.

Copy-paste prompts

Prompt 1

Walk me through running SWIM-7B inference on a 10-second video clip using the Transformers example and qwen_vl_utils

Prompt 2

Set up the SWIM training pipeline with DeepSpeed Zero-3 and BF16 on 8 GPUs and reproduce one epoch on NL-Refer

Prompt 3

Explain how the attention supervision loss in SWIM uses <ins> tags and where in the patched transformers fork it lives

Prompt 4

Convert the NL-Refer 125k caption split into a Hugging Face dataset for evaluation against a baseline Qwen2.5-VL

Prompt 5

Adapt SWIM's attention alignment idea to a 3B-class video LLM with limited GPU memory

Frequently asked questions

What is swim?

Official CVPR 2026 code for SWIM, a video MLLM fine-tuned from Qwen2.5-VL-7B that aligns attention maps with referring text so the model describes a named object in a busy scene without hallucinating.

What language is swim written in?

Mainly Python. The stack also includes Python, PyTorch, Transformers.

How hard is swim to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is swim for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.