Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

📝 Paper Summary

Visual Instruction Tuning Hallucination Mitigation in LMMs

The paper mitigates multi-modal hallucination by fine-tuning models on LRV-Instruction, a dataset balancing positive and negative visual instructions, and evaluates performance via a GPT-4-based reference-free metric.

Core Problem

Current Large Multi-Modal Models (LMMs) frequently hallucinate inconsistent descriptions and tend to answer 'Yes' to existence questions regardless of image content.

Why it matters:

LMMs inherit hallucination issues from Large Language Models (LLMs), leading to harmful consequences when users over-rely on them
Existing training data lacks diversity and primarily consists of positive instruction samples, causing models to over-rely on language priors rather than visual evidence
Evaluation metrics like CHAIR are unstable, and binary classification metrics require rigid templates, failing to capture open-ended hallucination

Concrete Example: When asked to describe an image of a room, MiniGPT4 describes a nonexistent 'dog' engaging in a nonexistent activity like 'playing with a ball' because these words statistically co-occur in language, ignoring the actual image content.

Key Novelty

LRV-Instruction Dataset & GAVIE Evaluation

Constructs a balanced dataset (LRV-Instruction) containing both positive instructions and three levels of negative instructions (Nonexistent Object, Existent Object, Knowledge Manipulation) to teach models to say 'No'
Proposes GAVIE (GPT4-Assisted Visual Instruction Evaluation), where GPT-4 acts as a 'smart teacher' to score model responses against dense captions without needing ground truth answers

Architecture

Illustration of the prompt used to generate visual instructions via GPT-4.

Evaluation Highlights

Finetuned MiniGPT4 achieves 85.0 accuracy on POPE (Random split), surpassing the original MiniGPT4's 56.8 by +28.2 points
Finetuned mPLUG-Owl achieves 7.44 Relevancy score on the GAVIE benchmark, outperforming InstructBLIP (6.61) and LLaVA (5.26)
Including negative instructions (1:1 ratio) improves accuracy on negative samples from 48.0% to 77.1% compared to positive-only training

Breakthrough Assessment

7/10

Strong contribution in data-centric AI by proving that negative instruction samples are critical for robustness. The proposed evaluation metric is practical but relies on proprietary GPT-4.

⚙️ Technical Details

Problem Definition

Setting: Visual Instruction Tuning where the model must generate text responses $Y$ given an image $I$ and instruction $X$, minimizing hallucination

Inputs: Image $I$, Natural Language Instruction $X$

Outputs: Natural Language Response $Y$

Pipeline Flow

Visual Encoder (extracts image features)
Alignment Module (Q-Former/Abstractor connects vision to text)
LLM Decoder (generates response based on aligned features + instruction)

System Modules

Visual Encoder

Extract visual features from the input image

Model or implementation: Vision Transformer (ViT-G/14 from EVA-CLIP for MiniGPT4)

Alignment Module

Bridge the gap between visual features and LLM embedding space

Model or implementation: Q-Former + Linear Layer (MiniGPT4) or Visual Abstractor (mPLUG-Owl)

LLM Decoder

Generate text response based on visual soft prompts and text instruction

Model or implementation: Vicuna (based on LLaMA)

Modeling

Base Model: MiniGPT4 (Vicuna-7B) and mPLUG-Owl (Vicuna-7B)

Training Method: Supervised Fine-Tuning (Instruction Tuning)

Training Data:

LRV-Instruction: 400k GPT-4 generated instructions based on Visual Genome
Includes 16 vision-language tasks
Three levels of negative instructions: Nonexistent Object, Existent Object (wrong attribute), Knowledge Manipulation

Key Hyperparameters:

model_size: 7B parameters (for both MiniGPT4 and mPLUG-Owl experiments)

Compute: NVIDIA Quadro RTX 8000

Comparison to Prior Work

vs. MiniGPT4/LLaVA: LRV-Instruction includes explicit negative instructions to correct 'yes-bias'
vs. POPE: GAVIE allows open-ended evaluation rather than just binary Yes/No accuracy
vs. InstructBLIP: LRV-Instruction covers broad tasks but synthesizes data via GPT-4 rather than aggregating existing datasets

Limitations

Evaluation (GAVIE) relies on GPT-4, which can have its own hallucinations or biases and is not open-source.
Negative instructions on 'Existent Object Manipulation' (e.g., wrong attributes) remain challenging for current vision encoders.
The approach was only validated on 7B parameter models due to compute constraints.

Reproducibility

Code: https://github.com/FuxiaoLiu/LRV-Instruction

Code and data are publicly available at https://github.com/FuxiaoLiu/LRV-Instruction. The paper relies on GPT-4 for data generation and evaluation, which makes exact reproduction dependent on OpenAI API versions (GPT4-32k-0314 used).

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on hallucination and instruction following benchmarks

Benchmarks:

LRV-Instruction Eval Set (Open-ended VQA) [New]
POPE (Object Hallucination (Binary Classification))
MME (Perception and Cognition)
AMBER (Object Hallucination)

Metrics:

GAVIE Score (Accuracy, Relevancy)
Accuracy (for POPE)
MME Score
Statistical methodology: Standard Deviation (STD) reported for GAVIE stability check

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on the proposed GAVIE benchmark shows finetuned models outperform baselines in both Accuracy and Relevancy.
LRV-Instruction Eval (GAVIE)	Accuracy (0-10)	5.68	6.43	+0.75
LRV-Instruction Eval (GAVIE)	Relevancy (0-10)	6.61	7.44	+0.83
Hallucination evaluation on external benchmarks (POPE) confirms significant reduction in hallucination rates.
POPE (Random split)	Accuracy (%)	56.8	85.0	+28.2
POPE (Adversarial split)	Accuracy (%)	52.8	82.8	+30.0
Ablation study on data composition ratios demonstrates the necessity of negative samples.
LRV-Instruction Eval	Accuracy on Negative Samples	48.0	77.1	+29.1
LRV-Instruction Eval	Accuracy on Positive Samples	88.6	90.0	+1.4

Main Takeaways

Current LMMs suffer from severe hallucination due to unbalanced (positive-only) training data.
Finetuning on LRV-Instruction significantly mitigates hallucination across multiple benchmarks (POPE, AMBER, MME) without sacrificing general performance.
A balanced ratio (1:1) of positive to negative instructions yields the most robust model performance.
Existent Object Manipulation (wrong attributes) and Knowledge Manipulation are harder for models to handle than simple Nonexistent Object queries.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Multi-Modal Models (LMMs) architecture
Concept of Hallucination in generative models
Visual Instruction Tuning

Key Terms

LMM: Large Multi-Modal Model—a model capable of processing and generating content across multiple modalities (e.g., image and text)

LRV-Instruction: Large-scale Robust Visual Instruction—the authors' proposed dataset containing 400k visual instructions with balanced positive and negative samples

GAVIE: GPT4-Assisted Visual Instruction Evaluation—the authors' proposed evaluation method using GPT-4 to score accuracy and relevancy without ground truth

Negative Instructions: Instructions asking about objects, attributes, or relationships that are NOT present in the image, forcing the model to deny or correct the premise

POPE: Polling for Object Existence—a benchmark that evaluates hallucination by asking binary 'Is there a...' questions

Visual Genome: A dataset with detailed visual annotations (objects, attributes, relationships) used as the source for generating instructions

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices