Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

📝 Paper Summary

Vision–Language Models (VLMs) Hallucination mitigation Mechanistic interpretability

The paper identifies specific 'PIH-heads' in Vision–Language Models that prioritize prompt text over visual evidence, showing that ablating them significantly reduces hallucinations without retraining.

Core Problem

Vision–Language Models (VLMs) often hallucinate by prioritizing incorrect information in textual prompts (e.g., mismatched object counts) over conflicting visual evidence.

Why it matters:

Real-world user inputs are often noisy or inaccurate, leading deployed VLMs to hallucinate rather than correct the user.
Prior work shows VLMs struggle to disentangle conflicting modalities, favoring text over vision, which degrades reliability in tasks like counting.
Current mitigation strategies often require extensive retraining or data, whereas this problem stems from internal routing mechanisms.

Concrete Example: When an image contains three waterlilies but the prompt asks to 'Describe the four waterlilies', the model hallucinates a fourth flower and describes it in detail, rather than correcting the count to three.

Key Novelty

Prompt-Induced Hallucination (PIH) Ablation

Identifies a small set of attention heads (PIH-heads) in the early layers of the language model component that act as conduits for copying incorrect prompt information.
Demonstrates that simply 'switching off' (mean-ablating) these heads stops the model from copying the prompt's error and forces it to look at the image, correcting the hallucination.
Shows these mechanisms generalize: heads found via object counting also fix hallucinations in color recognition tasks.

Evaluation Highlights

Ablating PIH-heads reduces prompt-induced hallucinations in counting tasks by up to 54%, restoring visually grounded responses.
In a color identification task, the same intervention reduces prompt-color copying by up to 94.25%.
LLaVA-OneVision shows a 4.35% improvement in baseline counting accuracy (on correct prompts) after ablation, indicating better general visual grounding.

Breakthrough Assessment

8/10

Strong mechanistic finding: identifies a specific, removable cause of a common VLM failure mode. The cross-task generalization (counting to color) without retraining suggests a fundamental architectural insight.

⚙️ Technical Details

Problem Definition

Setting: Controlled object-counting and color-identification under conflicting prompt–image inputs.

Inputs: An image with N objects/color C, and a text prompt P implying N+k objects or color C+k (where k is a discrepancy).

Outputs: A textual description or count of the objects in the image.

Pipeline Flow

Input: Image + Misaligned Prompt (e.g., 'Describe the 5 cats' for an image with 3 cats)
VLM Processing (Standard): Early attention heads copy prompt constraints; Model hallucinates 5 cats
VLM Processing (Intervention): Identified PIH-heads are mean-ablated; Model ignores prompt constraint; Model attends to image and reports 3 cats

System Modules

Vision Encoder

Encodes the visual input into embeddings

Model or implementation: SigLIP (for LLaVA/Qwen2-VL) or equivalent per model

Language Model

Generates the response based on visual and textual tokens

Model or implementation: Qwen2-7B (for LLaVA/Qwen-VL) or DeepSeek-LLM-7B (Janus-Pro)

Novel Architectural Elements

Inference-time modification: Targeted mean-ablation of specific 'PIH attention heads' identified via causal analysis, effectively altering the model's internal routing without retraining.

Modeling

Base Model: LLaVA-OneVision-7B, Qwen2-VL-7B, Janus-Pro-7B

Training Method: No training involved. The paper uses pre-trained models and applies inference-time intervention (ablation).

Compute: Not reported in the paper (Inference-only experiments)

Comparison to Prior Work

vs. General Hallucination Mitigation: Focuses specifically on Prompt-Induced Hallucination (PIH) arising from user-prompt conflicts, rather than general ungroundedness.
vs. Training-based methods: Requires zero training or data; relies purely on inference-time ablation of internal mechanisms.
vs. Decoding strategies: Intervenes at the architectural level (attention heads) rather than the decoding/token selection stage.

Limitations

Analysis is limited to three specific VLMs (LLaVA-OneVision, Qwen2-VL, Janus-Pro) and may not transfer to all architectures.
The method relies on identifying heads on a calibration set (counting task), though they generalize to color.
Ablation is a blunt instrument; while effective, it removes the head's function entirely rather than selectively filtering only incorrect prompt info.
The study focuses on relatively simple visual tasks (counting, color) where ground truth is objective.

Reproducibility

Code: https://github.com/michalg04/prompt-induced_hallucinations.git

📊 Experiments & Results

Evaluation Setup

Controlled generation under conflict: Prompts intentionally overstate object counts or misstate colors compared to the image.

Benchmarks:

CountBench (Object Counting (Modified for PIH))
Visual CounterFact (Color Identification (Modified for PIH))

Metrics:

Prompt Match Rate (Percentage of outputs matching the incorrect prompt)
True Count/Color Match Rate (Percentage of outputs matching the visual ground truth)
Exact Match Accuracy (on baseline prompts)
Statistical methodology: Pearson correlation reported for confidence vs. prompt conformity. Confidence intervals not explicitly reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ablating PIH heads drastically reduces the rate at which models copy incorrect object counts from the prompt and increases the rate at which they report the true visual count.
CountBench	Prompt Match Rate	42.13	10.08	-32.05
CountBench	True Count Match Rate	24.03	78.43	+54.40
CountBench	Prompt Match Rate	63.24	10.74	-52.50
CountBench	True Count Match Rate	24.89	69.47	+44.58
CountBench	True Count Match Rate	15.00	70.20	+55.20
Heads identified via counting task generalize to color tasks, reducing prompt copying without re-identification.
Visual CounterFact	Prompt Match Rate	98.50	4.25	-94.25
Visual CounterFact	Prompt Match Rate	95.50	44.75	-50.75

Experiment Figures

Line graphs showing the probability of the model outputting the Prompt Count (Blue) vs. True Count (Orange) as the number of objects increases.

Bar charts displaying the shift in attention mass from text tokens to image tokens after ablating PIH heads.

Main Takeaways

PIH behavior is highly structured: models resist small discrepancies at low counts (N<4) but capitulate to prompts at higher counts or counts >4.
A small set of specific attention heads (often in early layers like L0) mediates this copying behavior.
Ablation mechanisms differ by model: LLaVA-OneVision shifts attention mass significantly to image tokens, while Janus-Pro primarily suppresses format copying.
The identified heads are task-agnostic regarding the type of hallucination; heads found via counting errors also fix color errors.

📚 Prerequisite Knowledge

Prerequisites

Attention mechanisms in Transformers
Vision–Language Model (VLM) architecture
Mechanistic interpretability (specifically ablation)

Key Terms

PIH: Prompt-Induced Hallucinations—errors where a model generates output consistent with a misleading prompt rather than the visual evidence

Ablation: Selectively disabling specific components (here, attention heads) of a model to study their causal effect on behavior

Mean ablation: Replacing the output of an attention head with its average activation over a dataset, effectively neutralizing its specific signal while maintaining average statistics

Discrepancy distance: The magnitude of the difference between the ground truth (e.g., 3 objects) and the prompt's claim (e.g., 5 objects)

Sycophancy: The tendency of a model to agree with or conform to the user's input/bias, even when that input is incorrect

LLaVA-OneVision: A state-of-the-art open-source Vision–Language Model family

Attention head: A sub-component of the Transformer architecture that learns to focus on different parts of the input sequence

Format copying: A behavior where the model outputs the correct answer but mimics the stylistic format (e.g., digit vs. word) of the incorrect prompt