← Back to Paper List

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, Jitao Sang
Beijing Jiaotong University, Alibaba Group, Peng Cheng Lab
arXiv (2023)
MM Factuality Benchmark

📝 Paper Summary

Hallucination Evaluation Vision-Language Benchmarks
AMBER is a multi-dimensional hallucination benchmark for Multi-modal Large Language Models that evaluates both generative and discriminative tasks across existence, attribute, and relation hallucinations without relying on GPT-4 judges.
Core Problem
Current MLLM hallucination evaluations are either costly (relying on humans/GPT-4), narrow in scope (only checking object existence), or limited to specific task types (only generative or only discriminative).
Why it matters:
  • Hallucinations in MLLMs (Multi-modal Large Language Models) can lead to harmful consequences if users over-rely on unfaithful content
  • Existing generative evaluations using GPT-4 are expensive and hard to scale for academic research
  • Existing discriminative evaluations (like POPE) only check object existence, missing critical attribute and relationship errors
Concrete Example: A model might correctly identify a 'dog' in an image (passing existence checks) but incorrectly describe it as 'running' when it is 'lying down' (attribute hallucination) or claim it is 'on the sofa' when it is 'on the floor' (relation hallucination). AMBER captures these nuances where previous object-detection-based methods failed.
Key Novelty
AMBER (An LLM-free Multi-dimensional Benchmark)
  • Unified evaluation of both generative tasks (image description) and discriminative tasks (yes/no QA) using a single comprehensive annotation set
  • LLM-free evaluation pipeline that uses deterministic rules and standard metrics (Precision, Recall, CHAIR) rather than opaque GPT-4 judgment, ensuring reproducibility and low cost
  • Fine-grained annotation covering three hallucination types: Existence (objects), Attribute (state, number, action), and Relation (spatial contact)
Architecture
Architecture Figure Figure 3
The AMBER evaluation pipeline, from input processing to metric calculation for both task types
Evaluation Highlights
  • Evaluation of 9 mainstream MLLMs (including GPT-4V) on 1,004 diverse images revealed persistent hallucinations across all models
  • Introduces 'AMBER Score', a composite metric combining generative hallucination rates (CHAIR) and discriminative performance (F1)
  • Analysis reveals trade-offs: some models excel at detecting objects (discriminative) but hallucinate frequently when describing them (generative)
Breakthrough Assessment
7/10
Provides a much-needed standardized, low-cost benchmark covering multiple hallucination dimensions. While not introducing a new model architecture, it significantly advances evaluation methodology.
×