ALOHa: A New Measure for Hallucination in Captioning Models

📝 Paper Summary

Hallucination Detection Image Captioning Evaluation

ALOHa is an open-vocabulary metric for image captioning that uses LLMs and semantic embeddings to detect and localize object hallucinations more accurately than fixed-list matching methods.

Core Problem

Existing hallucination metrics like CHAIR rely on exact string matching against a fixed set of MS COCO objects, failing to generalize to open-vocabulary settings or ambiguous object descriptions.

Why it matters:

State-of-the-art vision-language models still hallucinate objects not present in the scene, undermining reliability.
Current metrics cannot handle synonyms, attributes, or objects outside the training distribution (e.g., 'wolf' vs. 'dog').
Rigid string matching penalizes valid but specific descriptions or fails to detect hallucinations in novel domains.

Concrete Example: If a model captions an image with 'a purple shirt' but the reference says 'a white shirt', CHAIR might miss the mismatch if 'shirt' matches, or penalize incorrectly. Conversely, if a model predicts 'wolf' for a 'dog', CHAIR treats it as a binary error, whereas ALOHa captures semantic similarity to penalize it less than 'potato'.

Key Novelty

Assessment with Language models for Object Hallucination (ALOHa)

Uses an LLM for zero-shot in-context learning to extract groundable objects and attributes from captions, replacing rigid parser-based extraction.
Computes a 'hallucination score' using the Hungarian matching algorithm on semantic embeddings (S-BERT) between candidate and reference objects.
Handles uncertainty (e.g., 'possibly a Frisbee') and complex noun phrases by filtering uncertain objects and matching based on semantic distance rather than binary presence.

Evaluation Highlights

+13.6% improvement in identifying hallucinated objects on the HAT dataset compared to the CHAIR metric.
+30.8% improvement in identifying hallucinations on the nocaps-FOIL dataset (out-of-domain objects) compared to CHAIR.
Outperforms CLIPScore by 8.5% in Average Precision (AP) for hallucination detection on HAT.

Breakthrough Assessment

7/10

Significant improvement in evaluating open-vocabulary captions, addressing a major limitation of the standard CHAIR metric. The introduction of the HAT dataset is also a valuable contribution.

⚙️ Technical Details

Problem Definition

Setting: Reference-based evaluation of image captioning models for object hallucination

Inputs: Candidate caption C, set of reference captions R, and image I

Outputs: Object-level hallucination scores (ALOHa_o) and a caption-level hallucination score (ALOHa)

Pipeline Flow

Object Extraction (LLM + DETR)
Object Filtering (Parsing & Uncertainty Handling)
Object Matching (Semantic Embedding + Hungarian Algorithm)

System Modules

Object Extractor

Extract visual objects (noun phrases) from candidate and reference captions; detect objects in image

Model or implementation: ChatGPT (text extraction) + DETR (image detection)

Object Filter

Refine object sets: handle conjunctions, remove uncertain objects (e.g., 'possibly'), and normalize reference objects

Model or implementation: spaCy (for root noun extraction)

Semantic Matcher (Object Matching)

Compute semantic similarity between filtered candidate and reference objects

Model or implementation: S-BERT (Sentence-BERT)

Scorer (Object Matching)

Calculate final hallucination scores using optimal assignment

Model or implementation: Hungarian matching algorithm

Novel Architectural Elements

Integration of LLM-based zero-shot object extraction with semantic embedding matching for evaluation metric
Formulation of hallucination detection as a continuous linear assignment problem (Hungarian matching) rather than binary string matching

Modeling

Base Model: ChatGPT (for extraction), S-BERT (for embedding)

Comparison to Prior Work

vs. CHAIR: ALOHa uses open-vocabulary extraction (LLM) and semantic matching (S-BERT) vs. fixed-list string matching
vs. CLIPScore: ALOHa provides localizable object-level scores vs. global sentence score; ALOHa is reference-based
vs. POPE: ALOHa evaluates a single generated caption for hallucinations vs. POPE which evaluates model propensity via VQA-style probing queries

Limitations

Relies on closed-source LLMs (ChatGPT), leading to potential non-determinism and API costs.
Requires high-quality reference captions; underperforms reference-free methods when references are impoverished.
Slower and more expensive to compute than CHAIR due to LLM calls.
S-BERT is optimized for sentence similarity, potentially leading to inaccuracies for single-word object comparisons.

Reproducibility

Code: https://davidmchan.github.io/aloha

Publicly available code and HAT dataset. Relies on closed-source LLM (ChatGPT) for parsing, which introduces non-determinism and API costs. Experiments cost ~$120 USD.

📊 Experiments & Results

Evaluation Setup

Evaluation of hallucination detection metrics on expert-annotated and synthetic datasets

Benchmarks:

HAT (Hallucination Detection & Localization) [New]
nocaps-FOIL (Synthetic Hallucination Detection (Out-of-Domain)) [New]
FOIL (Synthetic Hallucination Detection (In-Domain COCO))

Metrics:

Average Precision (AP)
Localization Accuracy (LA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on the new HAT dataset (Gold Standard). ALOHa significantly outperforms baselines in both detecting (AP) and localizing (LA) hallucinations.
HAT	Average Precision (AP)	69.1	80.9	+11.8
HAT	Localization Accuracy (LA)	30.3	43.9	+13.6
Performance on Synthetic Datasets (FOIL and nocaps-FOIL). Shows ALOHa's superiority in out-of-domain settings.
nocaps-FOIL	Localization Accuracy (LA)	0.0	30.8	+30.8
FOIL (COCO)	Average Precision (AP)	95.5	81.6	-13.9

Main Takeaways

ALOHa generalizes better to out-of-domain data (nocaps) where fixed-list metrics like CHAIR fail completely.
Combining LLMs for extraction with S-BERT for embedding outperforms purely parser-based (spaCy) or word-embedding (Word2Vec) approaches.
ALOHa is robust to missing image detections, maintaining high performance even without DETR augmentation (AP 80.9 vs 78.4 without DETR).
While CHAIR is superior on in-domain COCO data (FOIL) due to the dataset construction favoring fixed dictionaries, ALOHa is far more effective on realistic, open-vocabulary data (HAT).

📚 Prerequisite Knowledge

Prerequisites

Image Captioning
Hallucination in Vision-Language Models
Semantic Text Embeddings
Bipartite Matching

Key Terms

CHAIR: Captioning Hallucination Assessment with Image Relevance—a standard metric that detects hallucinations by string-matching objects against a fixed list (MS COCO classes)

S-BERT: Sentence-BERT—a modification of the BERT network that uses siamese, triplet, and softmax networks to derive semantically meaningful sentence embeddings

Hungarian matching: An optimization algorithm that solves the assignment problem (finding the best pairing between two sets) in polynomial time

DETR: DEtection TRansformer—an end-to-end object detection model that uses transformers

HAT: HAllucination Test—a new gold-standard dataset introduced in this paper, annotated by experts for hallucinations in captions

nocaps: Novel Object Captioning at Scale—a benchmark dataset for image captioning involving objects not seen in the COCO training set

FOIL: A dataset where objects in captions are replaced with similar 'foil' objects to test hallucination detection

AP: Average Precision—a metric measuring the area under the precision-recall curve

LA: Localization Accuracy—the accuracy of correctly indicating exactly which object in a caption is hallucinated