Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision

📝 Paper Summary

Multimodal Hallucination Mitigation Self-Correction in LMMs

Volcano is a single multimodal model that mitigates hallucinations by generating its own natural language feedback based on visual inputs and iteratively revising its responses without external reward models.

Core Problem

Large Multimodal Models (LMMs) frequently hallucinate by relying on language priors rather than grounding accurately on visual features, leading to responses misaligned with the image.

Why it matters:

Hallucinations in multimodal systems undermine trust by fabricating verifiable visual details (e.g., describing objects not present in the scene)
Existing revision methods often require separate specialized models or reinforcement learning pipelines, which are complex and resource-intensive
Vision encoders often fail to produce precise features, causing the language model to guess based on parametric knowledge instead of visual evidence

Concrete Example: When asked about an image, a standard LMM might describe an object that isn't there because it commonly appears in similar contexts in text data. Volcano would generate feedback like 'The object X is not visible in the bottom left,' and then revise the description to remove the hallucinated object.

Key Novelty

Self-Feedback Guided Revision Loop

Incorporates a critique-revise-decide loop within a single model: it generates an initial response, produces natural language feedback about that response's visual alignment, and then revises it.
Uses a 'decision' step to compare the revised response against the original to prevent degradation, rather than blindly accepting revisions.
Training data is synthesized by a proprietary LLM (GPT-3.5) using text-based proxies (captions/object lists) for images, allowing the open-source model to learn feedback generation.

Evaluation Highlights

Achieves a 24.9% performance enhancement on multimodal hallucination benchmarks compared to previous specialized mitigation methods (LURE, Woodpecker, LLaVA-RLHF)
Volcano-13B scores approximately twice as high as the baseline LLaVA-1.5 13B on math-related visual tasks, demonstrating improved reasoning capabilities
Qualitative analysis shows generated feedback attends to images with higher intensity and coverage than initial responses, confirming the feedback mechanism grounds the model better

Breakthrough Assessment

7/10

Strong empirical results and a clean, single-model approach to self-correction. While the core idea of self-reflection is established in LLMs, applying it effectively to LMMs with a full feedback-revise-decide loop is a valuable contribution.

⚙️ Technical Details

Problem Definition

Setting: Visual Question Answering (VQA) and multimodal dialogue with a focus on hallucination reduction

Inputs: Image I and Question Q

Outputs: Final revised response R_best

Pipeline Flow

Generation: Initial Response
Critique: Feedback Generation
Correction: Revision
Evaluation: Decision Making

System Modules

Initial Generator

Generate the first draft response based on the image and question

Model or implementation: Volcano (LLaVA-1.5 architecture)

Feedback Generator

Critique the current best response and provide visual cues/corrections

Model or implementation: Volcano (Same model)

Reviser

Generate a new response incorporating the feedback

Model or implementation: Volcano (Same model)

Decision Maker

Compare R_best and R_revised to select the superior response

Model or implementation: Volcano (Same model)

Novel Architectural Elements

Unified single-model architecture for generation, feedback, revision, and decision (unlike prior works using separate corrector models)
Iterative critique-revise-decide inference loop applied to multimodal inputs

Modeling

Base Model: LLaVA-1.5 (7B and 13B variants)

Training Method: Supervised Fine-Tuning (SFT) on constructed feedback/revision data

Training Data:

Source: LLaVA-SFT-127k dataset
Visual Instruction Data: llava-1.5-mix665k
Feedback/Revision Synthesis: Uses GPT-3.5-turbo with text-based image proxies (captions + object details) to generate gold feedback and revisions

Key Hyperparameters:

backbone_models: LLaVA-1.5 7B & 13B

Compute: Not reported in the paper

Comparison to Prior Work

vs. LURE: Volcano uses a single model for self-revision rather than a separate corrector model
vs. Woodpecker: Volcano integrates visual feedback directly rather than relying on converting visuals to text for a blind LLM corrector
vs. LLaVA-RLHF: Volcano uses natural language feedback for correction rather than scalar reward signals

Limitations

Inference latency increases due to the iterative nature of the critique-revise-decide loop
Relies on the quality of synthetic feedback data generated by GPT-3.5 during training
The proprietary LLM used for data generation (GPT-3.5) cannot see images directly, relying on text proxies (captions/objects) which may lose visual nuance

Reproducibility

Code: https://github.com/kaistAI/Volcano

Code, data, and models (7B & 13B) are publicly released at github.com/kaistAI/Volcano. The paper details the prompts used for data collection in Appendix B (not in provided text) and mentions hyperparameters in Appendix D (not in provided text).

📊 Experiments & Results

Evaluation Setup

Multimodal Hallucination and General Multimodal Understanding

Benchmarks:

POPE (Object Hallucination (Yes/No polling))
GAVIE (Hallucination & Instruction Following (GPT-4 evaluated))
MMHal-Bench (Open-ended Hallucination Evaluation)
MM-Vet (Complex Multimodal Tasks)
MMBench (Visual Perception & Reasoning)

Metrics:

Accuracy
F1 Score
Relevancy
GPT-4 Score (MMHal-Bench)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Aggregated Hallucination Benchmarks	Performance Enhancement	Not reported in the paper	Not reported in the paper	+24.9%
Multimodal Understanding (Math)	Math Score	Not reported in the paper	Not reported in the paper	2x

Main Takeaways

Volcano consistently outperforms LLaVA-1.5 and baseline correctors (LURE, Woodpecker) across hallucination benchmarks, indicating that self-feedback is more effective than external revision models.
The 'decision' stage (Stage 3) is crucial; skipping it and always accepting the revised response leads to lower performance, suggesting the model is better at discriminating quality than generating it.
Providing natural language feedback is more effective than scalar rewards (RLHF) or direct revision without feedback (LURE), as it carries fine-grained visual information to guide the correction.
Performance improves with the number of allowed iterations, though this introduces a trade-off with inference latency.

📚 Prerequisite Knowledge

Prerequisites

Large Multimodal Models (LMMs) architecture (e.g., CLIP + LLM)
Concept of hallucination in generative models
Instruction tuning and Supervised Fine-Tuning (SFT)

Key Terms

Multimodal Hallucination: When a model generates text descriptions (e.g., objects, actions) that contradict or are unsupported by the provided visual input

Grounding: The process of linking abstract concepts in text to specific pixels or regions in an image

LMM: Large Multimodal Model—a system combining a vision encoder and a large language model to process image-text inputs

Self-revision: A process where a model critiques and corrects its own output without external human intervention during inference

POPE: Polling-based Object Probing Evaluation—a benchmark that asks yes/no questions about the existence of objects in an image to test for hallucinations

Parametric knowledge: Information stored in the model's weights during pre-training, as opposed to information derived from the current input context