Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

📝 Paper Summary

Multimodal Large Language Models (LVLMs) Hallucination Mitigation

HA-DPO mitigates hallucinations in multimodal models by reframing the problem as preference optimization, utilizing a style-consistent dataset construction pipeline to ensure models learn factuality rather than stylistic patterns.

Core Problem

Large Vision-Language Models (LVLMs) frequently suffer from hallucinations—generating plausible but incorrect details about images—which limits their reliability in critical tasks.

Why it matters:

Hallucinated details (e.g., non-existent objects, wrong attributes) mislead users and can have severe consequences in fields like medical diagnostics.
Existing Supervised Fine-Tuning (SFT) methods require expensive high-quality annotations, while post-hoc correction methods increase inference latency and depend on external tools.
Standard RLHF/DPO approaches struggle with data quality and distribution shifts, where models learn to distinguish responses based on writing style rather than factual content.

Concrete Example: When asking an LVLM to describe an image, it might confidently describe a 'red car' that isn't present. Standard training might try to correct this with a human-written caption, but if the correction has a different writing style than the model's output, the model learns to mimic the style instead of correcting the hallucination.

Key Novelty

Hallucination-Aware Direct Preference Optimization (HA-DPO) with Style-Consistent Data

Reframes hallucination elimination as a DPO preference task where the model learns to favor non-hallucinatory outputs over hallucinatory ones without a separate reward model.
Introduces a 'Style-Consistent' data construction pipeline: GPT-4 rewrites both the correct (positive) and incorrect (negative) responses to share the same linguistic style, preventing the model from exploiting style shortcuts.
Proposes Sentence-level Hallucination Ratio (SHR), a fine-grained metric for evaluating hallucinations beyond fixed object categories.

Evaluation Highlights

MiniGPT-4 improved POPE accuracy from 51.13% to 86.13% (+35 absolute points) after HA-DPO training.
MiniGPT-4 MME score increased from 932.00 to 1326.46 (+42.32% relative improvement).
HA-DPO stabilizes training: unlike standard DPO where fluency degrades over time due to distribution shifts, the style-consistent approach maintains sentence fluency throughout optimization.

Breakthrough Assessment

8/10

Significant performance jumps on standard benchmarks (POPE/MME) and addresses a critical DPO failure mode (style exploitation) in multimodal settings. The automated data pipeline reduces reliance on expensive human feedback.

⚙️ Technical Details

Problem Definition

Setting: Multimodal text generation where the model must generate a description y given an image x_I and text prompt x_T.

Inputs: Image prompt x_I and Text prompt x_T

Outputs: Textual response y that is factually consistent with x_I

Pipeline Flow

Image + Text Input
Large Vision-Language Model (Fine-tuned via HA-DPO)
Non-hallucinatory Text Output

System Modules

Large Vision-Language Model

Generates textual descriptions or answers based on multimodal inputs

Model or implementation: MiniGPT-4 (also applied to LLaVA, InstructBLIP)

Novel Architectural Elements

Integration of an auxiliary causal language modeling loss (SFT loss) directly into the DPO preference learning objective to prevent performance regression.

Modeling

Base Model: MiniGPT-4 (primary results), LLaVA, InstructBLIP

Training Method: Hallucination-Aware Direct Preference Optimization (HA-DPO)

Objective Functions:

Purpose: Optimize policy to favor non-hallucinatory responses.

Formally: L_DPO = -E [log sigmoid( beta * log(pi_theta(y_pos)/pi_ref(y_pos)) - beta * log(pi_theta(y_neg)/pi_ref(y_neg)) )]
Purpose: Maintain general language capabilities and training stability.

Formally: L_SFT = -E [log pi_theta(y | x_P)]
Purpose: Combined optimization objective.

Formally: L = L_DPO + lambda * L_SFT

Training Data:

Source: Visual Genome (VG) dataset
Description Generation: LVLM generates initial descriptions (prone to hallucination)
Correction: GPT-4 uses VG annotations to detect hallucinations and generate corrected (positive) and original/hallucinated (negative) pairs
Augmentation: GPT-4 rewrites BOTH positive and negative samples to ensure style consistency and converts descriptions into QA pairs

Key Hyperparameters:

beta: Not explicitly reported in the paper
lambda: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. SFT: HA-DPO is more data-efficient and allows custom preference definition (hallucination elimination) without massive annotation overhead.
vs. Post-Hoc: HA-DPO modifies the model itself, requiring no extra inference-time tools or latency.
vs. Other DPO/RLHF: HA-DPO uses a 'Style-Consistent' dataset where positive/negative pairs are linguistically similar, preventing the model from learning stylistic shortcuts—a major failure mode in previous attempts.

Limitations

The method relies on GPT-4 for data construction and hallucination detection, creating a dependency on a closed-source model.
The effectiveness of the style-consistency rewrite depends on the quality of the rewrite prompt and GPT-4's adherence to it.
The paper provides detailed results for MiniGPT-4 but less granular numeric breakdowns for LLaVA and InstructBLIP in the provided text.

Reproducibility

Code: https://opendatalab.github.io/HA-DPO

Code, models, and datasets are publicly available at https://opendatalab.github.io/HA-DPO. The paper details the prompts used for GPT-4 based data generation/correction. Specific hyperparameters (learning rate, batch size, lambda) for the DPO training are not explicitly detailed in the provided text.

📊 Experiments & Results

Evaluation Setup

Assessment of multimodal hallucination and general performance capabilities.

Benchmarks:

POPE (Object Existence Evaluation (Hallucination))
MME (Comprehensive Multimodal Evaluation)
SHR (Sentence-level Hallucination Ratio) [New]

Metrics:

Accuracy (POPE)
Score (MME)
Sentence-level Hallucination Ratio (SHR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Application of HA-DPO to MiniGPT-4 yields substantial improvements in both hallucination metrics and general multimodal capabilities.
POPE	Accuracy	51.13	86.13	+35.00
MME	Score	932.00	1326.46	+394.46

Experiment Figures

Comparison of data distribution and sentence fluency during training with and without style consistency.

Main Takeaways

HA-DPO significantly reduces hallucination rates (POPE) while simultaneously improving general model performance (MME), suggesting that reducing hallucinations improves overall grounding.
Style consistency in the DPO dataset is critical; without it, models may suffer from 'preference collapse' or loss of fluency (repeating words) during training.
The proposed data pipeline (Generation -> Correction -> Style Augmentation) effectively creates high-quality preference pairs without human annotation.

📚 Prerequisite Knowledge

Prerequisites

Direct Preference Optimization (DPO)
Reinforcement Learning from Human Feedback (RLHF)
Large Vision-Language Models (LVLMs)
Supervised Fine-Tuning (SFT)

Key Terms

DPO: Direct Preference Optimization—a method to align language models to preferences by optimizing a policy to satisfy a preference ranking without explicitly training a reward model

POPE: Polling on Object Existence—a benchmark for evaluating object hallucination in LVLMs by asking yes/no questions about object presence

MME: Multimodal Evaluation—a comprehensive benchmark for evaluating LVLM performance across various tasks

Hallucination: The phenomenon where a model generates content (objects, attributes, relationships) that does not exist in the source image

Style Consistency: Ensuring that positive and negative training samples in a preference dataset share the same linguistic patterns (length, tone, vocabulary) so the model optimizes for content, not style

Visual Genome: A large-scale dataset with detailed image annotations (objects, attributes, relationships) used here as ground truth for hallucination detection

SHR: Sentence-level Hallucination Ratio—a metric proposed in this paper to quantify hallucinations at the sentence level rather than just object existence