Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Image Captioning Hallucination Mitigation

CapMAS improves detailed image captions by using an LLM to decompose long descriptions into atomic claims, which an MLLM then verifies against the image to remove hallucinations.

Core Problem

Existing MLLMs hallucinate frequently when generating long, detailed captions because they rely more on their own generated text than the input image as sequence length increases.

Why it matters:

Current hallucination detection methods (Confidence, Consistency) fail to detect errors that occur later in long sequences (after ~192 tokens)
Standard captioning metrics (BLEU, CIDEr) require reference captions, which are impractical to collect for hyper-detailed descriptions
High-stakes applications like visual assistance for the impaired require descriptions that are both exhaustive (high coverage) and strictly factual

Concrete Example: In a long caption, an MLLM might correctly describe a room but then hallucinate 'a small red ball' at the very end. Standard methods checking token probabilities won't catch this because the model is confident in its own language flow, even though the ball isn't in the image.

Key Novelty

Caption factuality enhancing MultiAgent System (CapMAS)

Decomposition-Verification-Revision: Instead of correcting the whole text at once, an LLM breaks the caption into tiny 'True/False' statements (atomic propositions).
Context Isolation: An MLLM verifies each statement independently against the image, breaking the 'language prior' bias that causes hallucinations in long sequences.
Evaluation Framework: Introduces a new dual-metric approach measuring both 'Factuality' (using GPT-4o verification) and 'Coverage' (using a custom VQA dataset).

Architecture

The CapMAS pipeline: Decomposition, Verification, and Revision.

Evaluation Highlights

Significantly outperforms baselines in detecting hallucinations; the 'Isolation' method achieves much higher AUROC than confidence-based or consistency-based methods on long captions
Proposed evaluation metric aligns better with human judgment than existing metrics like CLAIR or ALOHa when testing against synthetic hallucination datasets (Object, Attribution, Relation)
Improves the factuality of captions generated by state-of-the-art models, including GPT-4V, without requiring any model training (plug-and-play)

Breakthrough Assessment

8/10

Identifies a critical failure mode in current hallucination detection (length bias) and proposes a robust, training-free multi-agent solution. The new evaluation framework for detailed captions is also a significant contribution.

⚙️ Technical Details

Problem Definition

Setting: Post-hoc correction of hyper-detailed image captions generated by MLLMs

Inputs: An input image and an initial detailed caption generated by an MLLM

Outputs: A revised caption that maintains coverage while removing non-factual information

Pipeline Flow

Decomposition Agent (LLM) breaks caption into atomic propositions
Verification Agent (MLLM) verifies each proposition against the image
Revision Agent (LLM) reconstructs the caption using only true propositions

System Modules

Decomposition Agent

Break down complex sentences into individual facts

Model or implementation: GPT-4 (or similar LLM)

Verification Agent

Check the truthfulness of each atomic proposition individually

Model or implementation: LLaMA-NeXT (or similar MLLM)

Revision Agent

Rewrite the caption to remove false information while keeping true details

Model or implementation: GPT-4 (or similar LLM)

Novel Architectural Elements

Isolation-based verification: Disconnecting specific claims from the long narrative context to bypass the model's tendency to hallucinate at the end of long sequences.

Modeling

Base Model: Evaluated using LLaVA-NeXT and GPT-4V as the caption generators; uses GPT-4 for decomposition/revision.

Compute: Not reported in the paper

Comparison to Prior Work

vs. VCD/OPERA: CapMAS is a post-hoc correction method, not a decoding intervention, and works better for long sequences where decoding methods fail.
vs. Woodpecker: CapMAS does not pre-define hallucination types and uses 'atomic propositions' rather than targeted object detection, allowing it to catch subtler errors.
vs. RAG-based correction [not cited in paper]: Unlike retrieval methods that might look up external knowledge, CapMAS uses the image itself as the ground truth for verification.

Limitations

Reliance on the verification MLLM's capabilities; if the verifier cannot see the object, it may incorrectly flag a true detail as false.
Computational cost is higher than single-pass generation due to the multi-step decompose-verify-revise process.
Does not explicitly address hallucinations related to text rendering (OCR) or complex spatial reasoning if the base MLLM is weak in those areas.

Reproducibility

Code: https://github.com/adobe-research/CapMAS

Code and data available at https://github.com/adobe-research/CapMAS. The method is training-free and relies on prompting existing models (GPT-4, LLaVA-NeXT).

📊 Experiments & Results

Evaluation Setup

Evaluating hallucination detection on long captions and evaluating the quality of corrected captions.

Benchmarks:

DOCCI (modified) (Detailed Image Captioning Factuality Evaluation)
IIW-400 (subset) (Hallucination Detection / VQA-based Coverage)

Metrics:

Factuality Score (GPT-4o based)
Coverage Score (VQA accuracy on generated captions)
AUROC (for hallucination detection)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Hallucination detection performance comparisons showing the proposed Isolation method outperforms baselines.
IIW-400 subset	AUROC	0.641	0.771	+0.130
IIW-400 subset	FPR95	0.830	0.621	-0.209
Evaluation of caption quality metrics showing the proposed metric is more reliable than existing ones.
DOCCI (Object Hallucination subset)	Score Gap (Clean - Object)	0.007	0.194	+0.187

Experiment Figures

Graphs of hallucination scores vs. token position index.

Main Takeaways

Existing hallucination detection methods (Confidence, Consistency) fail dramatically as caption length increases (specifically after ~192 tokens).
Standard metrics (CLIPScore, BLEU, etc.) are unreliable for measuring factuality in detailed captions; they often score hallucinated captions similarly to factual ones.
The 'Isolation' strategy—verifying facts without context—is crucial for correcting long captions because it removes the model's bias towards its own generated text.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multimodal Large Language Models (MLLMs)
Familiarity with the concept of Hallucination in LLMs
Basic knowledge of Visual Question Answering (VQA)

Key Terms

MLLM: Multimodal Large Language Model—an AI that can process both text and images to generate responses

Hallucination: When a model generates plausible-sounding but factually incorrect information (e.g., describing an object not present in the image)

Atomic Proposition: A simple, indivisible statement that can be clearly judged as either True or False (e.g., 'The cat is black')

VQA: Visual Question Answering—a task where a model answers questions about the content of an image

Greedy Decoding: A generation strategy where the model always picks the single most likely next word

Stochastic Decoding: A generation strategy where the model samples the next word based on probability, introducing randomness

AUROC: Area Under the Receiver Operating Characteristic—a performance metric for classification tasks; 1.0 is perfect, 0.5 is random guessing

BLEU/CIDEr/METEOR: Standard metrics for evaluating text generation by matching words against human-written references

CLIPScore: A metric measuring how well an image and a caption match using the CLIP model's embedding space