VIB-Probe: Detecting and Mitigating Hallucinations in Vision-Language Models via Variational Information Bottleneck

📝 Paper Summary

Hallucination suppression Internal state analysis

VIB-Probe detects hallucinations by distilling internal attention dynamics into a compact latent representation using the Information Bottleneck principle, and mitigates them by suppressing specific hallucination-sensitive attention heads.

Core Problem

Vision-Language Models (VLMs) frequently hallucinate objects or relations not present in images, and existing detectors relying on surface-level output statistics fail to capture the internal mechanistic causes of these errors.

Why it matters:

Hallucinations undermine trust in VLMs for high-stakes applications requiring precise visual grounding
Current detection methods relying on logit entropy or external tools overlook the internal attention dynamics where errors originate
Existing mitigation strategies often require expensive retraining or heavy external verification, lacking efficient inference-time control

Concrete Example: In an image captioning task, a VLM might generate 'a man holding a frisbee' when no frisbee exists. While output probabilities might be high due to language priors (men often hold frisbees in data), the internal attention heads responsible for visual grounding show distinct, detectable patterns of 'informational drift' that VIB-Probe captures.

Key Novelty

VIB-Probe (Variational Information Bottleneck Probe)

Treats the collection of all attention head outputs across layers as a high-dimensional signal containing both hallucination cues and noise
Applies the Information Bottleneck principle to compress this signal into a compact latent variable that maximizes prediction of hallucination labels while discarding irrelevant syntactic noise
Uses gradients from this trained probe to identify specific 'hallucination-sensitive' attention heads and suppresses them during inference to fix errors

Architecture

The overall framework of VIB-Probe for detection and mitigation.

Evaluation Highlights

Outperforms state-of-the-art baselines on generative benchmarks like M-HalDetect by +2.84% AUROC, showing superior handling of free-form text
Achieves robust cross-distribution generalization, maintaining performance even when trained on one dataset (POPE-Popular) and tested on others, unlike probing baselines which degrade significantly
Mitigation strategy improves CHAIR metrics on COCO captioning, reducing object hallucinations more effectively than contrastive decoding methods like VCD

Breakthrough Assessment

7/10

Strong methodological contribution by applying VIB to internal states for detection. The gradient-based mitigation is clever and training-free for the base model, though it requires training the probe. Results are consistent across architectures.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of generated tokens as hallucinatory or grounded based on internal model states

Inputs: Tensor of pre-projection attention head outputs T from all layers and heads at decoding step u

Outputs: Binary hallucination risk score s_u and intervention scaling factors alpha for attention heads

Pipeline Flow

VLM Forward Pass (extracts attention head outputs)
VIB-Probe Encoder (compresses attention tensor to latent z)
VIB-Probe Decoder (predicts hallucination risk)
Intervention (if risk > threshold, compute gradients to find sensitive heads)
Suppression (scale down sensitive heads and regenerate)

System Modules

VLM Backbone

Generate text and provide internal attention states

Model or implementation: LLaVA-v1.5-7B, LLaVA-v1.6-Mistral-7B, Qwen2.5-VL-7B-Instruct, or MiniGPT-4

VIB Encoder (Detection)

Compress high-dimensional attention states into a compact latent representation z

Model or implementation: 3-layer MLP (1024->512->256) with residual blocks

VIB Decoder / Classifier (Detection)

Predict hallucination probability from latent z

Model or implementation: Single linear layer

Novel Architectural Elements

Application of VIB directly to the stack of pre-projection attention head outputs across all layers
Gradient-based attribution mechanism that backpropagates from the probe's risk score to specific attention heads to determine intervention targets

Modeling

Base Model: LLaVA-v1.5-7B (primary), also tested on LLaVA-v1.6, Qwen2.5-VL, MiniGPT-4

Training Method: Training a lightweight probe (VIB-Probe) on top of frozen VLM features

Objective Functions:

Purpose: Minimize prediction error for hallucination detection.

Formally: Binary Cross Entropy loss L_pred(phi, theta)
Purpose: Compress representation to remove noise (Information Bottleneck).

Formally: KL Divergence between posterior q(z|v) and prior r(z) = N(0,I)
Purpose: Combined objective.

Formally: L = L_pred + beta * L_KL

Key Hyperparameters:

bottleneck_dimension: 256
learning_rate: 2e-5
kl_beta_max: 3e-4
+ 1 more
optimizer: AdamW

Compute: Lightweight probe training; inference requires forward pass + optional backward pass for intervention

Comparison to Prior Work

vs. RepProbing: VIB-Probe uses Variational Information Bottleneck to filter noise, whereas RepProbing uses raw hidden states
vs. OPERA/Lookback Lens: VIB-Probe uses the full tensor of pre-projection attention outputs rather than aggregated weights or final layer states
vs. VCD/PAI: VIB-Probe identifies intervention targets dynamically via gradient attribution from the probe, rather than using fixed heuristics or contrastive decoding

Limitations

Requires white-box access to internal attention heads (not applicable to API-only models)
Intervention requires a backward pass through the probe at inference time, adding computational overhead
Performance depends on the quality of the labeled hallucination data used to train the probe

Reproducibility

Code will be made publicly available. VLM backbones and datasets (POPE, AMBER, M-HalDetect) are public. Exact training duration not reported.

📊 Experiments & Results

Evaluation Setup

Hallucination detection (binary classification) and mitigation (generation quality improvement)

Benchmarks:

POPE (Discriminative Object Hallucination (Yes/No questions))
AMBER (Discriminative Attribute/Relation Hallucination)
M-HalDetect (Fine-grained Hallucination Detection in Responses)
COCO-Caption (Generative Image Captioning)

Metrics:

AUROC
AUPRC
CHAIR (CHAIR_i, CHAIR_s) for mitigation
Accuracy
F1 Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Detection performance on discriminative benchmarks (POPE, AMBER) and generative benchmarks (M-HalDetect, COCO-Caption) shows VIB-Probe consistently outperforming baselines.
POPE (Discriminative)	AUROC	86.15	87.35	+1.20
COCO-Caption	CHAIR_s (lower is better)	11.2	9.8	-1.4
POPE	Accuracy	86.13	86.89	+0.76
M-HalDetect (Cross-Distribution)	AUROC	46.21	78.65	+32.44

Experiment Figures

Cross-task generalization performance (AUROC) of VIB-Probe vs. RepProbing when trained on POPE-Popular and tested on generative tasks.

Main Takeaways

VIB-Probe consistently outperforms baselines in detection, with larger gains on generative tasks (+2.84% AUROC) compared to discriminative ones (+1.20%), suggesting better handling of complex text.
The Information Bottleneck effectively filters dataset-specific noise, allowing the probe to generalize significantly better across distributions (e.g., from POPE to M-HalDetect) than standard probing.
Intervention strategy based on gradient attribution successfully mitigates hallucinations, outperforming inference-time baselines like VCD and PAI on CHAIR metrics.
Ablation studies confirm the necessity of the KL-divergence term; removing it degrades performance to the level of a standard linear probe.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention mechanisms)
Variational Information Bottleneck (VIB) theory
Vision-Language Models (e.g., LLaVA architecture)

Key Terms

VIB: Variational Information Bottleneck—a method to learn a compressed representation that keeps only the information relevant to a target task

POPE: Polling for Object Hallucination Evaluation—a benchmark asking Yes/No questions about objects in an image to test VLM factuality

CHAIR: Captioning Hallucination Assessment with Image Relevance—a metric measuring the proportion of generated objects not present in the ground truth

AUROC: Area Under the Receiver Operating Characteristic curve—a performance metric for classification problems at various threshold settings

AUPRC: Area Under the Precision-Recall Curve—a metric focusing on positive class performance, useful for imbalanced datasets

KL divergence: Kullback-Leibler divergence—a statistical distance measuring how one probability distribution differs from a second, reference distribution

logit: The raw, unnormalized prediction vector generated by a neural network before the final activation function (like softmax)

attention heads: Components in Transformer models that allow the model to focus on different parts of the input sequence simultaneously