AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

📝 Paper Summary

Hallucination Evaluation Vision-Language Benchmarks

AMBER is a multi-dimensional hallucination benchmark for Multi-modal Large Language Models that evaluates both generative and discriminative tasks across existence, attribute, and relation hallucinations without relying on GPT-4 judges.

Core Problem

Current MLLM hallucination evaluations are either costly (relying on humans/GPT-4), narrow in scope (only checking object existence), or limited to specific task types (only generative or only discriminative).

Why it matters:

Hallucinations in MLLMs (Multi-modal Large Language Models) can lead to harmful consequences if users over-rely on unfaithful content
Existing generative evaluations using GPT-4 are expensive and hard to scale for academic research
Existing discriminative evaluations (like POPE) only check object existence, missing critical attribute and relationship errors

Concrete Example: A model might correctly identify a 'dog' in an image (passing existence checks) but incorrectly describe it as 'running' when it is 'lying down' (attribute hallucination) or claim it is 'on the sofa' when it is 'on the floor' (relation hallucination). AMBER captures these nuances where previous object-detection-based methods failed.

Key Novelty

AMBER (An LLM-free Multi-dimensional Benchmark)

Unified evaluation of both generative tasks (image description) and discriminative tasks (yes/no QA) using a single comprehensive annotation set
LLM-free evaluation pipeline that uses deterministic rules and standard metrics (Precision, Recall, CHAIR) rather than opaque GPT-4 judgment, ensuring reproducibility and low cost
Fine-grained annotation covering three hallucination types: Existence (objects), Attribute (state, number, action), and Relation (spatial contact)

Architecture

The AMBER evaluation pipeline, from input processing to metric calculation for both task types

Evaluation Highlights

Evaluation of 9 mainstream MLLMs (including GPT-4V) on 1,004 diverse images revealed persistent hallucinations across all models
Introduces 'AMBER Score', a composite metric combining generative hallucination rates (CHAIR) and discriminative performance (F1)
Analysis reveals trade-offs: some models excel at detecting objects (discriminative) but hallucinate frequently when describing them (generative)

Breakthrough Assessment

7/10

Provides a much-needed standardized, low-cost benchmark covering multiple hallucination dimensions. While not introducing a new model architecture, it significantly advances evaluation methodology.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of MLLM faithfulness to visual inputs across generative description and discriminative QA tasks

Inputs: Image (Img) and Instruction (Ins) constructed from prompt templates

Outputs: Textual response (R) from the MLLM (either a description or a Yes/No answer)

Pipeline Flow

Image Collection & Filtering
Fine-grained Annotation (Existence, Attribute, Relation, Hallucinatory Targets)
Prompt Template Construction (Generative & Discriminative)
Model Inference (9 MLLMs)
Response Processing & Metric Calculation

System Modules

Annotation Module

create ground truth for evaluation

Model or implementation: Human Annotators

Prompt Generator

Create query templates based on annotations

Model or implementation: Rule-based templates

Response Processor

Extract objects and answers from MLLM output

Model or implementation: NLTK / Rule-based matching

Novel Architectural Elements

Integration of generative and discriminative evaluations into a single pipeline using the same image/annotation source
Cognitive-bias based metric (Cog) that specifically tracks 'Hallucinatory target objects'—objects likely to be imagined due to context (e.g., expecting a keyboard near a monitor)

Comparison to Prior Work

vs. POPE: AMBER adds Attribute and Relation dimensions, plus Generative task evaluation
vs. GPT-4 Evaluation: AMBER is LLM-free, removing cost and reproducibility barriers
vs. MME [not cited in paper]: MME is another comprehensive benchmark; AMBER specifically focuses on hallucination types rather than general capabilities

Limitations

Dependency on NLTK for noun extraction in generative tasks may miss complex phrasings
Discriminative evaluation relies on simple Yes/No parsing, which might be confused by verbose model refusals
The set of 'hallucinatory target objects' is manually defined, which may not cover all potential model biases
Limited to English language evaluation

Reproducibility

Code: https://github.com/junyangwang0410/AMBER

Data and code are publicly available at https://github.com/junyangwang0410/AMBER. The benchmark uses a static set of 1,004 images and annotations, making it fully reproducible without API costs. Specific prompt templates are detailed in the paper.

📊 Experiments & Results

Evaluation Setup

Benchmarking 9 MLLMs on 1,004 images across generative and discriminative tasks

Benchmarks:

AMBER (Hallucination Evaluation (Generative Description & Visual QA)) [New]

Metrics:

CHAIR (Generative)
Cover (Generative - object coverage)
Hal (Generative - hallucination rate)
Cog (Generative - cognitive bias hallucination)
Accuracy, Precision, Recall, F1 (Discriminative)
AMBER Score (Composite)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

GPT-4V demonstrates superior performance but still suffers from hallucinations, particularly in attribute and relation details
Discriminative tasks generally show lower hallucination rates than generative tasks for most models
There is often a trade-off between Coverage (mentioning many objects) and Hallucination (mentioning wrong objects); models like InstructBlip tend to say less to be safer
Attribute and Relation hallucinations are significantly more common than simple Existence hallucinations, highlighting the weakness of previous existence-only benchmarks

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multi-modal Large Language Models (MLLMs)
Familiarity with the concept of Hallucination in AI (generating unfaithful content)
Basic knowledge of evaluation metrics like Precision, Recall, and F1

Key Terms

MLLM: Multi-modal Large Language Model—an AI model capable of processing and generating both text and visual data (e.g., GPT-4V, LLaVA)

Hallucination: The generation of content that appears plausible but is factually incorrect or unfaithful to the provided image content

Generative Task: A task where the model produces open-ended text, such as 'Describe this image'

Discriminative Task: A task where the model must classify or choose between options, here specifically answering 'Yes' or 'No' to verify visual details

CHAIR: Caption Hallucination Assessment with Image Relevance—a metric measuring the percentage of objects mentioned in a caption that do not actually exist in the image

AMBER Score: A composite score introduced in this paper combining the CHAIR metric (generative) and F1 score (discriminative) to rank MLLM performance

Existence Hallucination: Fabricating objects that are not present in the image at all

Attribute Hallucination: Correctly identifying an object but assigning it the wrong properties (e.g., wrong color, wrong action, wrong number)

Relation Hallucination: Incorrectly describing the relationship (usually spatial) between two existing objects

Counterfactual Prompting: Asking the model about something that isn't there (e.g., 'Is there a cat?' when there is none) to test if it hallucinates

LLM-free: Evaluation methods that do not require a separate Large Language Model (like GPT-4) to judge the correctness of outputs, relying instead on rule-based matching against annotations