← Back to Paper List

Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision

Seongyun Lee, Sue Hyun Park, Yongrae Jo, Minjoon Seo
Korea Advanced Institute of Science and Technology, LG AI Research
North American Chapter of the Association for Computational Linguistics (2023)
MM Factuality Reasoning

📝 Paper Summary

Multimodal Hallucination Mitigation Self-Correction in LMMs
Volcano is a single multimodal model that mitigates hallucinations by generating its own natural language feedback based on visual inputs and iteratively revising its responses without external reward models.
Core Problem
Large Multimodal Models (LMMs) frequently hallucinate by relying on language priors rather than grounding accurately on visual features, leading to responses misaligned with the image.
Why it matters:
  • Hallucinations in multimodal systems undermine trust by fabricating verifiable visual details (e.g., describing objects not present in the scene)
  • Existing revision methods often require separate specialized models or reinforcement learning pipelines, which are complex and resource-intensive
  • Vision encoders often fail to produce precise features, causing the language model to guess based on parametric knowledge instead of visual evidence
Concrete Example: When asked about an image, a standard LMM might describe an object that isn't there because it commonly appears in similar contexts in text data. Volcano would generate feedback like 'The object X is not visible in the bottom left,' and then revise the description to remove the hallucinated object.
Key Novelty
Self-Feedback Guided Revision Loop
  • Incorporates a critique-revise-decide loop within a single model: it generates an initial response, produces natural language feedback about that response's visual alignment, and then revises it.
  • Uses a 'decision' step to compare the revised response against the original to prevent degradation, rather than blindly accepting revisions.
  • Training data is synthesized by a proprietary LLM (GPT-3.5) using text-based proxies (captions/object lists) for images, allowing the open-source model to learn feedback generation.
Evaluation Highlights
  • Achieves a 24.9% performance enhancement on multimodal hallucination benchmarks compared to previous specialized mitigation methods (LURE, Woodpecker, LLaVA-RLHF)
  • Volcano-13B scores approximately twice as high as the baseline LLaVA-1.5 13B on math-related visual tasks, demonstrating improved reasoning capabilities
  • Qualitative analysis shows generated feedback attends to images with higher intensity and coverage than initial responses, confirming the feedback mechanism grounds the model better
Breakthrough Assessment
7/10
Strong empirical results and a clean, single-model approach to self-correction. While the core idea of self-reflection is established in LLMs, applying it effectively to LMMs with a full feedback-revise-decide loop is a valuable contribution.
×