Evaluating Object Hallucination in Large Vision-Language Models

📝 Paper Summary

Vision-Language Model Evaluation Object Hallucination

Large Vision-Language Models suffer significantly from object hallucination, which is exacerbated by instruction tuning and best evaluated using a polling-based binary classification approach (POPE) rather than caption generation.

Core Problem

Large Vision-Language Models (LVLMs) frequently generate descriptions containing objects not present in the image (hallucination), and existing evaluation metrics like CHAIR are unstable and reliant on complex parsing.

Why it matters:

Hallucination degrades user trust and creates safety risks in real-world applications like autonomous driving (e.g., hallucinating a nonexistent obstacle).
Current evaluation methods (CHAIR) are sensitive to instruction phrasing and caption length, making fair comparison between models difficult.
It is counter-intuitive that larger, more capable models might hallucinate more than smaller predecessors, requiring investigation.

Concrete Example: When asked to describe an image of a table with food, an LVLM might hallucinate a 'pear', 'knife', or 'bottle' because these objects frequently co-occur with dining tables in the training data, even if they are visually absent.

Key Novelty

Polling-based Object Probing Evaluation (POPE)

Shifts evaluation from open-ended caption generation to a binary classification task by asking simple 'Is there a [object] in the image?' questions.
Uses three negative sampling strategies (Random, Popular, Adversarial) to probe different types of hallucination tendencies.
Decouples evaluation from caption length and instruction wording, providing a more stable metric.

Architecture

The pipeline of the POPE evaluation method.

Evaluation Highlights

Current LVLMs show severe hallucination: LLaVA scores 50-54% accuracy on POPE Adversarial/Popular settings (near random guess), showing high overconfidence (99% 'Yes' rate).
InstructBLIP significantly outperforms other LVLMs (88.73 F1 on Random POPE vs ~50-68 for others), likely due to diverse instruction data.
Object hallucination is highly correlated with object frequency in instruction data: top-10 frequent objects account for ~50% of hallucinations.

Breakthrough Assessment

8/10

Systematically exposes the severity of hallucination in LVLMs and proposes a standard evaluation protocol (POPE) that has since become a key benchmark in the field.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of object existence in images given visual and text inputs.

Inputs: Image x, Question q ('Is there a <object> in the image?')

Outputs: Answer a ('Yes' or 'No')

Pipeline Flow

Object Extraction (Ground Truth via Annotations or SEEM)
Negative Sampling (Random / Popular / Adversarial)
Prompt Construction (Yes/No Question Templates)
Model Polling (Inference)
Metric Calculation (Precision/Recall/F1)

System Modules

Object Extraction (Data Preparation)

Identify ground-truth objects present in the image.

Model or implementation: Human Annotations (COCO) or SEEM (Automatic Segmentation)

Negative Sampler (Data Preparation)

Select nonexistent objects for 'No' questions based on specific strategies to test robustness.

LVLM Inference

Answer the probing questions.

Model or implementation: Target LVLM (e.g., LLaVA, mPLUG-Owl)

Novel Architectural Elements

Adversarial sampling strategy based on object co-occurrence matrices to target statistical biases in LVLMs.

Modeling

Base Model: Evaluated models: mPLUG-Owl, LLaVA, MultiModal-GPT, MiniGPT-4, InstructBLIP

📊 Experiments & Results

Evaluation Setup

Probing LVLMs on object existence using MSCOCO, A-OKVQA, and GQA datasets.

Benchmarks:

MSCOCO (Val Set) (Object Hallucination Evaluation)
A-OKVQA (Object Hallucination Evaluation (via SEEM))
GQA (Object Hallucination Evaluation (via SEEM))

Metrics:

F1 Score
Accuracy
Precision
Recall
Yes Ratio (percent of 'Yes' answers)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on MSCOCO using the POPE pipeline show that InstructBLIP is significantly more robust to hallucination than other models, while models like mPLUG-Owl and LLaVA exhibit extreme overconfidence (Yes Rate ~99%).
MSCOCO	F1 Score	68.06	89.29	+21.23
MSCOCO	F1 Score	66.98	78.45	+11.47
MSCOCO	Yes Ratio (Random)	50.00	95.37	+45.37
MSCOCO	Std Dev (Prompt Variation)	3.22	0.78	-2.44
MSCOCO	CHAIR_S	13.0	32.7	+19.7

Experiment Figures

Bar charts showing the correlation between object hallucination frequency and object appearance frequency in the training data.

Main Takeaways

Most LVLMs (except InstructBLIP) suffer from severe object hallucination, often defaulting to 'Yes' for any object query.
Hallucinations are not random; they are strongly biased toward objects that appear frequently in instruction tuning data or co-occur with present objects.
Visual instruction tuning appears to exacerbate hallucination compared to smaller pre-trained models (VLPMs), possibly due to hallucinations inherent in the synthetic instruction data used for training.
POPE offers a more stable and scalable evaluation method than CHAIR, especially when combined with automatic segmentation tools like SEEM.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Vision-Language Models (VLMs) and Instruction Tuning
Understanding of object hallucination in image captioning
Basic knowledge of evaluation metrics like Precision, Recall, and F1

Key Terms

LVLM: Large Vision-Language Model—a model integrating a visual encoder with a Large Language Model (LLM) for multimodal tasks.

Hallucination: Generating content (in this case, objects) that is inconsistent with or absent from the source input (image).

CHAIR: Caption Hallucination Assessment with Image Relevance—a metric calculating the proportion of hallucinated objects in generated captions.

POPE: Polling-based Object Probing Evaluation—the proposed method asking Yes/No questions to verify object existence.

Visual Instruction Tuning: Fine-tuning VLMs on pairs of images and instructions to improve adherence to human prompts.

SEEM: Segment Everything Everywhere All At Once—an automatic segmentation tool used here to annotate ground-truth objects in unannotated datasets.