Reduce hallucinations in video question-answering models without retraining by running MACD's contrastive decoding against your evaluation benchmark.
Test whether a video-language model like Qwen2.5-VL is guessing based on patterns or actually attending to what is visible in the video.
Run a full hallucination-suppression pipeline on yes/no or multiple-choice video questions using a shell script that handles detection, masking, inference, and scoring in sequence.
Study how masking visually relevant objects changes a model's output confidence to understand what the model is truly relying on.
Requires a GPU, separately downloaded Qwen2-VL model weights, FFmpeg, benchmark video datasets, and a pinned Python and HuggingFace transformers version to run correctly.
MACD stands for Model-Aware Contrastive Decoding, a research method aimed at reducing a specific problem in AI video understanding: hallucinations. A hallucination in this context means the AI model claims something is present in a video when it is not, or gets details wrong. This repository provides code for a technique that addresses that problem without requiring any additional training of the underlying model. The core idea is to build a "counterfactual" version of the input video by detecting and masking the objects most relevant to the question being asked, then running the model on both the original and the masked video at the same time. By comparing what the model says about each version, the method can suppress the parts of its output that rely on guessing or pattern-matching rather than what is actually visible. YOLO, a standard object-detection tool, is used to find those relevant objects, and the masking strength is tuned automatically per clip. In practical terms, you point the code at a video benchmark dataset and a set of questions in a specific JSON format, run a shell script, and it handles detection, mask optimization, counterfactual video synthesis, inference, and scoring in sequence. The supported question types are yes/no and multiple-choice. The README lists several video-language model checkpoints that have been tested, including variants of Qwen2-VL and Qwen2.5-VL, which are open-weight models you download separately. This is a research release, not a production-ready library. The repository contains only the method implementation and a basic evaluation pipeline, and it is tested on a specific Python version with a pinned version of the Hugging Face transformers library. External assets such as model weights, benchmark videos, and FFmpeg must be obtained separately before the pipeline can run.
← chloeqxq on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.