← Back to Paper List

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Y. Yacoob, Lijuan Wang
University of Maryland, College Park, Microsoft Corporation
International Conference on Learning Representations (2023)
MM Factuality Benchmark

📝 Paper Summary

Visual Instruction Tuning Hallucination Mitigation in LMMs
The paper mitigates multi-modal hallucination by fine-tuning models on LRV-Instruction, a dataset balancing positive and negative visual instructions, and evaluates performance via a GPT-4-based reference-free metric.
Core Problem
Current Large Multi-Modal Models (LMMs) frequently hallucinate inconsistent descriptions and tend to answer 'Yes' to existence questions regardless of image content.
Why it matters:
  • LMMs inherit hallucination issues from Large Language Models (LLMs), leading to harmful consequences when users over-rely on them
  • Existing training data lacks diversity and primarily consists of positive instruction samples, causing models to over-rely on language priors rather than visual evidence
  • Evaluation metrics like CHAIR are unstable, and binary classification metrics require rigid templates, failing to capture open-ended hallucination
Concrete Example: When asked to describe an image of a room, MiniGPT4 describes a nonexistent 'dog' engaging in a nonexistent activity like 'playing with a ball' because these words statistically co-occur in language, ignoring the actual image content.
Key Novelty
LRV-Instruction Dataset & GAVIE Evaluation
  • Constructs a balanced dataset (LRV-Instruction) containing both positive instructions and three levels of negative instructions (Nonexistent Object, Existent Object, Knowledge Manipulation) to teach models to say 'No'
  • Proposes GAVIE (GPT4-Assisted Visual Instruction Evaluation), where GPT-4 acts as a 'smart teacher' to score model responses against dense captions without needing ground truth answers
Architecture
Architecture Figure Figure 3
Illustration of the prompt used to generate visual instructions via GPT-4.
Evaluation Highlights
  • Finetuned MiniGPT4 achieves 85.0 accuracy on POPE (Random split), surpassing the original MiniGPT4's 56.8 by +28.2 points
  • Finetuned mPLUG-Owl achieves 7.44 Relevancy score on the GAVIE benchmark, outperforming InstructBLIP (6.61) and LLaVA (5.26)
  • Including negative instructions (1:1 ratio) improves accuracy on negative samples from 48.0% to 77.1% compared to positive-only training
Breakthrough Assessment
7/10
Strong contribution in data-centric AI by proving that negative instruction samples are critical for robustness. The proposed evaluation metric is practical but relies on proprietary GPT-4.
×