chloeqxq/macd

★ 20PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((macd))
    What it does
      Reduces video hallucinations
      Contrastive decoding
      No model retraining
    How it works
      YOLO object detection
      Counterfactual masking
      Original vs masked compare
    Tech Stack
      Python
      YOLO
      Qwen2-VL
      HuggingFace transformers
    Use Cases
      Video QA evaluation
      Hallucination research
      Model reliability testing

mindmap root((macd)) What it does Reduces video hallucinations Contrastive decoding No model retraining How it works YOLO object detection Counterfactual masking Original vs masked compare Tech Stack Python YOLO Qwen2-VL HuggingFace transformers Use Cases Video QA evaluation Hallucination research Model reliability testing

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Reduce hallucinations in video question-answering models without retraining by running MACD's contrastive decoding against your evaluation benchmark.

USE CASE 2

Test whether a video-language model like Qwen2.5-VL is guessing based on patterns or actually attending to what is visible in the video.

USE CASE 3

Run a full hallucination-suppression pipeline on yes/no or multiple-choice video questions using a shell script that handles detection, masking, inference, and scoring in sequence.

USE CASE 4

Study how masking visually relevant objects changes a model's output confidence to understand what the model is truly relying on.

Tech stack

PythonYOLOQwen2-VLtransformersFFmpeg

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a GPU, separately downloaded Qwen2-VL model weights, FFmpeg, benchmark video datasets, and a pinned Python and HuggingFace transformers version to run correctly.

No license information is mentioned in this repository.

In plain English

MACD stands for Model-Aware Contrastive Decoding, a research method aimed at reducing a specific problem in AI video understanding: hallucinations. A hallucination in this context means the AI model claims something is present in a video when it is not, or gets details wrong. This repository provides code for a technique that addresses that problem without requiring any additional training of the underlying model. The core idea is to build a "counterfactual" version of the input video by detecting and masking the objects most relevant to the question being asked, then running the model on both the original and the masked video at the same time. By comparing what the model says about each version, the method can suppress the parts of its output that rely on guessing or pattern-matching rather than what is actually visible. YOLO, a standard object-detection tool, is used to find those relevant objects, and the masking strength is tuned automatically per clip. In practical terms, you point the code at a video benchmark dataset and a set of questions in a specific JSON format, run a shell script, and it handles detection, mask optimization, counterfactual video synthesis, inference, and scoring in sequence. The supported question types are yes/no and multiple-choice. The README lists several video-language model checkpoints that have been tested, including variants of Qwen2-VL and Qwen2.5-VL, which are open-weight models you download separately. This is a research release, not a production-ready library. The repository contains only the method implementation and a basic evaluation pipeline, and it is tested on a specific Python version with a pinned version of the Hugging Face transformers library. External assets such as model weights, benchmark videos, and FFmpeg must be obtained separately before the pipeline can run.

Copy-paste prompts

Prompt 1

I want to test MACD on a Qwen2.5-VL checkpoint with a yes/no video QA benchmark. Walk me through the required JSON question format and the shell script command to run the full pipeline.

Prompt 2

Explain how MACD's counterfactual masking works: how does YOLO decide which objects to mask, and how is the masking strength tuned automatically per video clip?

Prompt 3

I'm seeing hallucinations in my video question-answering model on multiple-choice questions. How do I apply MACD's contrastive decoding to compare original vs masked outputs and suppress the pattern-matching answers?

Prompt 4

What are the exact Python version and transformers library version requirements to run MACD, and what breaks if I use a newer version of transformers?

Prompt 5

How does MACD compute its final score by comparing model output on the original video vs the object-masked counterfactual? Walk me through the math behind the contrastive step.

Open on GitHub → Explain another repo

← chloeqxq on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.