← Back to Paper List

ALOHa: A New Measure for Hallucination in Captioning Models

Suzanne Petryk, David M. Chan, Anish Kachinthaya, Haodi Zou, John Canny, Joseph E. Gonzalez, Trevor Darrell
University of California, Berkeley
arXiv (2024)
MM Factuality Benchmark

📝 Paper Summary

Hallucination Detection Image Captioning Evaluation
ALOHa is an open-vocabulary metric for image captioning that uses LLMs and semantic embeddings to detect and localize object hallucinations more accurately than fixed-list matching methods.
Core Problem
Existing hallucination metrics like CHAIR rely on exact string matching against a fixed set of MS COCO objects, failing to generalize to open-vocabulary settings or ambiguous object descriptions.
Why it matters:
  • State-of-the-art vision-language models still hallucinate objects not present in the scene, undermining reliability.
  • Current metrics cannot handle synonyms, attributes, or objects outside the training distribution (e.g., 'wolf' vs. 'dog').
  • Rigid string matching penalizes valid but specific descriptions or fails to detect hallucinations in novel domains.
Concrete Example: If a model captions an image with 'a purple shirt' but the reference says 'a white shirt', CHAIR might miss the mismatch if 'shirt' matches, or penalize incorrectly. Conversely, if a model predicts 'wolf' for a 'dog', CHAIR treats it as a binary error, whereas ALOHa captures semantic similarity to penalize it less than 'potato'.
Key Novelty
Assessment with Language models for Object Hallucination (ALOHa)
  • Uses an LLM for zero-shot in-context learning to extract groundable objects and attributes from captions, replacing rigid parser-based extraction.
  • Computes a 'hallucination score' using the Hungarian matching algorithm on semantic embeddings (S-BERT) between candidate and reference objects.
  • Handles uncertainty (e.g., 'possibly a Frisbee') and complex noun phrases by filtering uncertain objects and matching based on semantic distance rather than binary presence.
Evaluation Highlights
  • +13.6% improvement in identifying hallucinated objects on the HAT dataset compared to the CHAIR metric.
  • +30.8% improvement in identifying hallucinations on the nocaps-FOIL dataset (out-of-domain objects) compared to CHAIR.
  • Outperforms CLIPScore by 8.5% in Average Precision (AP) for hallucination detection on HAT.
Breakthrough Assessment
7/10
Significant improvement in evaluating open-vocabulary captions, addressing a major limitation of the standard CHAIR metric. The introduction of the HAT dataset is also a valuable contribution.
×