Visual Hallucinations of Multi-modal Large Language Models

📝 Paper Summary

Visual Hallucination in Multi-modal LLMs Benchmark Creation Adversarial Testing

VHTest generates challenging visual hallucination instances by identifying confusing image pairs in existing datasets, creating text descriptions of failure modes, and generating new adversarial images via DALL-E 3.

Core Problem

Existing benchmarks for multi-modal LLM hallucinations rely on static datasets like COCO, which limits diversity and risks data contamination (since models may have trained on them).

Why it matters:

Limited diversity in existing datasets leads to a biased understanding, often overestimating MLLM performance
Data contamination prevents accurate assessment of how models handle truly unseen or challenging visual scenarios
Visual hallucinations in critical applications (like autonomous systems) pose safety risks, necessitating rigorous adversarial testing

Concrete Example: An image contains three lamps. An MLLM (like GPT-4V) incorrectly states there are two lamps. This simple counting failure highlights the model's inability to ground text generation in visual facts, even for basic objects.

Key Novelty

Adversarial Generation via CLIP/DINO Discrepancy & Text-to-Image Synthesis (VHTest)

Identifies 'confusing' image pairs that have high similarity in CLIP embedding space but low similarity in DINO v2 space, indicating potential for visual misunderstanding
Uses an LLM to generate text descriptions of why these images cause hallucinations, then uses a text-to-image model (DALL-E 3) to synthesize diverse new images based on these descriptions
Constructs a benchmark of 1,200 instances across 8 specific hallucination modes (e.g., counting, shape, OCR) to rigorously test MLLMs

Architecture

The VHTest pipeline: Discovery → Description → Generation

Evaluation Highlights

State-of-the-art MLLMs fail frequently on this benchmark: GPT-4V achieves only 0.383 accuracy overall
Open-source models struggle significantly: MiniGPT-v2 achieves only 0.075 overall accuracy
Fine-tuning LLaVA-1.5 on the generated VHTest data improves accuracy on position hallucinations by +20.0% (from 0.333 to 0.533)

Breakthrough Assessment

8/10

Proposes a novel, scalable pipeline for generating adversarial visual benchmarks that exposes severe weaknesses in top-tier models (GPT-4V) previously thought to be robust.

⚙️ Technical Details

Problem Definition

Setting: Visual Question Answering (VQA) focusing on factuality

Inputs: An image I and a question Q related to visual properties

Outputs: A text response A that should be factually consistent with image I

Pipeline Flow

Step I: Find Initial VH Instances (Identify confusing pairs via CLIP/DINO)
Step II: Generate Text Descriptions (LLM describes hallucination causes)
Step III: Generate New Instances (Text-to-Image synthesis + Template Q&A)

System Modules

Confusing Pair Selector

Find image pairs in COCO that confuse the vision encoder

Model or implementation: CLIP (similarity > 0.9) and DINO v2 (similarity < 0.55)

Description Generator

Analyze successful/unsuccessful hallucination examples to describe visual properties that trigger errors

Model or implementation: GPT-4V

Image Generator (Step III: Instance Construction)

Synthesize new images based on the generated text descriptions

Model or implementation: DALL-E 3

QA Constructor (Step III: Instance Construction)

Create questions and reference answers for the new images

Model or implementation: Human annotators using templates

Novel Architectural Elements

Adversarial selection criteria using contradictory signals from two different vision encoders (CLIP vs. DINO v2) to find 'hard' examples
Automated loop using an MLLM to verbalize failure modes (text descriptions) and a generative model to scale up the dataset

Modeling

Base Model: Evaluated models: GPT-4V, LLaVA-1.5 (7B/13B), MiniGPT-v2, mPLUG-Owl2, InstructBLIP, Qwen-VL-Chat

Training Method: Supervised Fine-Tuning (SFT) for mitigation experiments

Training Data:

Split the constructed VHTest benchmark into 80% training / 20% testing

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 16
epochs: 1

Compute: Not reported in the paper

Comparison to Prior Work

vs. POPE: VHTest covers 8 modes (not just existence) and uses synthetic images to avoid contamination
vs. MME: VHTest focuses specifically on adversarial/hard instances rather than general capabilities
vs. MMVP: MMVP uses the collected pairs directly; VHTest uses them as seeds to generate descriptions and synthesize *new* diverse images via DALL-E 3

Limitations

Reliance on DALL-E 3 may introduce biases present in the image generation model
Manual annotation/verification is still required for the final QA pairs (human-in-the-loop)
Focuses on static images; does not address video or temporal hallucinations

Reproducibility

Code: https://github.com/wenhuang2000/VHTest

Benchmark dataset (images/questions) and code are publicly available at https://github.com/wenhuang2000/VHTest. Prompt templates for generation are in the Appendix. Specific training compute resources not listed.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on the generated VHTest benchmark (1,200 instances)

Benchmarks:

VHTest (OEQ) (Open-Ended Question VQA) [New]
VHTest (YNQ) (Binary Yes/No VQA) [New]

Metrics:

Accuracy (fraction of responses matching reference answer)
Parsing rate (for OEQ)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance of state-of-the-art MLLMs on the VHTest benchmark shows significant hallucination rates.
VHTest	Overall Accuracy	0.383	0.383	0.00
VHTest	Overall Accuracy	0.383	0.075	-0.308
Breakdown by specific hallucination modes reveals different model weaknesses.
VHTest	Accuracy (Orientation Mode)	0.383	0.153	-0.230
VHTest	Accuracy (OCR Mode)	0.229	0.127	-0.102
Fine-tuning experiments demonstrate that the VHTest dataset can be used to mitigate hallucinations.
VHTest (Test Split)	Accuracy (Position Mode)	0.333	0.533	+0.200

Experiment Figures

A motivating example of visual hallucination.

Examples of generated VH instances across different modes.

Main Takeaways

Even the strongest proprietary models (GPT-4V) suffer from severe visual hallucinations on adversarial examples, with <40% accuracy.
Models exhibit different vulnerability profiles; GPT-4V struggles with Orientation, while open-source models like LLaVA struggle heavily with OCR and Counting.
Fine-tuning on the adversarial dataset (VHTest) effectively reduces hallucination rates without degrading performance on general VQA benchmarks (like MME).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multi-modal Large Language Models (MLLMs)
Familiarity with CLIP and DINO v2 vision encoders
Basic knowledge of text-to-image generation (e.g., DALL-E 3)

Key Terms

VH: Visual Hallucination—when an MLLM generates text details about an image that are factually incorrect

MLLM: Multi-modal Large Language Model—AI system capable of processing and generating both text and images (e.g., GPT-4V)

CLIP: Contrastive Language-Image Pre-training—a model that learns to map images and text to a shared embedding space

DINO v2: A self-supervised vision transformer model known for learning robust visual features without text supervision

VHTest: The proposed tool/framework for generating diverse visual hallucination instances

OCR: Optical Character Recognition—the conversion of images of typed, handwritten, or printed text into machine-encoded text

OEQ: Open-Ended Question—questions requiring free-form text answers

YNQ: Yes/No Question—questions constrained to a binary yes or no answer