LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Hallucination Evaluation

LongHalQA is an LLM-free benchmark for MLLMs containing 6K long-context questions that unifies hallucination discrimination and completion into a multiple-choice format to evaluate complex, real-world scenarios.

Core Problem

Existing MLLM hallucination benchmarks either use oversimplified discriminative questions (short yes/no queries) or computationally expensive generative evaluations relying on unstable LLM judges.

Why it matters:

Current benchmarks with short questions fail to capture hallucinations in sophisticated real-world scenarios involving long descriptions and multi-round conversations
Reliance on fixed object sets (e.g., COCO's 80 categories) limits variability and biases evaluation
LLM-based evaluators for generative tasks are slow, costly, and introduce randomness that affects reliability

Concrete Example: A standard benchmark might ask 'Is there a cat?' (binary). LongHalQA presents a 130-word description where the model must identify subtle inconsistencies, like 'four plates' vs 'five plates' or mixed-up spatial descriptions like 'shirts in the central part' vs 'right part'.

Key Novelty

Unified MCQ format for Long-Context Hallucination

Transforms both discrimination (spotting errors) and completion (avoiding generation errors) into Multiple-Choice Questions (MCQs), eliminating the need for external LLM evaluators
Focuses specifically on long-context data (130-189 words avg) including object descriptions, image descriptions, and multi-round conversations, rather than short captions
Introduces LongHallGen, an automated pipeline using GPT-4V to generate, check, and format complex hallucination data

Architecture

Comparison between previous benchmarks (top) and LongHalQA (bottom), illustrating the data formats and task types.

Evaluation Highlights

Qwen2-VL-72B achieves the best performance on hallucination completion tasks among open-source models, surpassing LLaVA-v1.6-34B
Chain-of-Thought (COT) prompting degrades performance for most MLLMs on long-context hallucination discrimination, despite helping with short queries
GPT-4o outperforms other models in hallucination discrimination, particularly for multi-round conversations (+9.5% accuracy gain over others)

Breakthrough Assessment

8/10

Strong contribution by shifting focus to long-context hallucinations and unifying evaluation into an efficient MCQ format. The automated generation pipeline is valuable, though the reliance on GPT-4V for ground truth generation introduces some circular dependency risks.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of MLLMs on detecting and avoiding hallucinations in long text aligned with images

Inputs: An image and a long text segment (either a description or conversation history)

Outputs: A multiple-choice selection (A, B, C, or D) corresponding to either the correct diagnosis of a hallucination or the correct text completion

Pipeline Flow

Image Collection & Filtering (VisualGenome/Objects365)
Positive Data Generation (GPT-4V generates long text)
Hallucination Check (GPT-4V self-check + GroundingDINO verification)
Hallucination-Explanation Pair Generation (Inject errors/explanations)
MCQ Construction (Format as Discrimination or Completion tasks)

System Modules

Image Selector (Data Construction)

Select complex images from validation sets to avoid training leakage

Model or implementation: GroundingDINO (for complexity filtering)

Generator (Data Construction)

Generate initial long descriptions and conversations

Model or implementation: GPT-4V

Verifier (Data Construction)

Detect and filter intrinsic hallucinations in generated text

Model or implementation: GPT-4V + GroundingDINO

Task Formatter (Data Construction)

Convert text into Hallucination Discrimination and Completion MCQs

Model or implementation: GPT-4V

Novel Architectural Elements

Unified MCQ framework for both generative (Completion) and discriminative (Discrimination) hallucination tasks
LongHallGen pipeline integrating object detection (GroundingDINO) with MLLM (GPT-4V) for automated long-context hallucination data creation

Modeling

Base Model: Evaluates 10 MLLMs including GPT-4o, Qwen2-VL, LLaVA-1.6, MiniCPM-V2

Training Method: Not applicable — this is a benchmark/evaluation paper, not a model training paper

Comparison to Prior Work

vs. POPE/MME: LongHalQA uses long-context (130+ words) and MCQs with explanations, unlike simple binary short questions
vs. Hal-Eval: LongHalQA evaluates generation via 'Hallucination Completion' MCQs, avoiding costly and unstable LLM evaluators
vs. AMBER [not cited in paper]: LongHalQA focuses on complex logic and context consistency in long text, whereas AMBER primarily targets object existence and attributes in shorter queries

Limitations

Reliance on GPT-4V for data generation and verification may introduce bias from the generator model
Evaluation is limited to static images, not video or 3D data
MCQ format, while efficient, may not perfectly capture the open-ended nature of free-form generation hallucinations

Reproducibility

Code: https://github.com/hanqiu-hq/LongHalQA

Publicly available: Dataset and evaluation code at https://github.com/hanqiu-hq/LongHalQA. Dependencies: Requires OpenAI API access for GPT-4V/GPT-4o if replicating data generation or evaluating the closed-source model.

📊 Experiments & Results

Evaluation Setup

Evaluation of 10 MLLMs on 6485 MCQs covering discrimination and completion tasks

Benchmarks:

LongHalQA (Long-context Hallucination Discrimination & Completion) [New]

Metrics:

Accuracy (Binary)
Precision (Binary)
Yes Ratio (Binary)
MC-Accuracy (Multiple Choice)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Hallucination Discrimination (MCQ Setting): GPT-4o leads, but open-source models like LLaVA-1.6-34B and Qwen2-VL-72B show competitive performance on specific sub-tasks.
LongHalQA (Discrimination - Image Desc)	MC-Accuracy	69.1	73.4	+4.3
LongHalQA (Discrimination - Conversation)	MC-Accuracy	66.5	76.0	+9.5
Hallucination Completion (MCQ Setting): Larger models generally perform better at avoiding hallucinations when completing text, with Qwen2-VL-72B leading.
LongHalQA (Completion - Avg)	MC-Accuracy	61.0	63.9	+2.9
LongHalQA (Discrimination - Avg)	MC-Accuracy	46.61	39.42	-7.19

Experiment Figures

Examples of complex hallucinations involving logic and contextual consistency.

Main Takeaways

Models struggle significantly more with long-context hallucinations (image-level descriptions/conversations) than with short object-level descriptions.
Chain-of-Thought (COT) prompting, usually helpful for hallucinations, actually degrades performance on long-context discrimination for most models (especially smaller ones).
High-resolution image support (as seen in Qwen2-VL and LLaVA-1.6) correlates with better hallucination resistance.
The unified MCQ format for 'Hallucination Completion' shows similar trends to free-form generation evaluation but is much faster and cheaper.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multimodal Large Language Models (MLLMs)
Familiarity with hallucination in LLMs (generating content unaligned with inputs)
Knowledge of standard evaluation metrics (Accuracy, Precision)

Key Terms

MLLM: Multimodal Large Language Model—AI models capable of processing and generating both text and images

Hallucination: A phenomenon where models generate plausible textual responses that contradict the visual content of the image

MCQ: Multiple-Choice Question—a format used here to evaluate models by asking them to select the correct option from a list

COT: Chain-Of-Thought—a prompting technique encouraging models to reason step-by-step before answering

LongHallGen: The authors' proposed automated pipeline for generating long-context hallucination data using GPT-4V

GPT-4V: GPT-4 with Vision capabilities—a strong proprietary MLLM used here for data generation

Object-level Description: Text describing specific attributes, states, or relations of a single object

Image-level Description: Text covering the main content, background, and details of an entire image in a paragraph

Multi-round Conversation: Simulated dialogue between a user and an assistant about the image content