Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

📝 Paper Summary

Vision-Language Model Alignment Hallucination Mitigation Preference Optimization

POVID aligns vision-language models by generating synthetic dispreferred data—hallucinating text via GPT-4V and distorting images to trigger inherent errors—then fine-tuning via Direct Preference Optimization.

Core Problem

Vision Large Language Models (VLLMs) hallucinate because text generation is not perfectly aligned with visual inputs, often prioritizing language priors over image content.

Why it matters:

Current preference tuning methods (like RLHF) rely on costly human data or model-generated pairs where both answers might be wrong, failing to anchor the correct answer to the image
Hallucinations in VLLMs pose significant risks in high-stakes deployment scenarios like medical imaging or autonomous driving
Existing methods struggle to create effective 'negative' samples that specifically target the disconnect between visual perception and text generation

Concrete Example: Given an image of a table with a knife and oranges, a standard VLLM might hallucinate a 'fork' because it statistically co-occurs with 'knife' in text data, ignoring the actual image. POVID deliberately distorts the image to force this error, then trains the model to reject it.

Key Novelty

Preference Optimization with AI-Generated Dispreferences (POVID)

Generates negative training data (dispreferences) automatically without humans, using two strategies: asking GPT-4V to insert plausible hallucinations into correct text, and adding noise to images to trigger the model's own internal errors
Uses Direct Preference Optimization (DPO) to contrast these synthetic hallucinations against ground-truth descriptions, forcing the model to trust visual cues over language priors

Architecture

The POVID framework's data generation and training pipeline.

Evaluation Highlights

Reduces object hallucination metric (CHAIRS) to 31.8 on the CHAIR benchmark, significantly outperforming the base LLaVA-1.5 model (66.8)
Achieves 68.7 on LLaVA-Bench, surpassing RLHF-V (65.4) and LLaVA-1.5 (63.4) without using human preference data
Outperforms larger models like InstructBLIP and Qwen-VL-Chat on 5 out of 8 benchmarks despite using a smaller 7B backbone

Breakthrough Assessment

8/10

Significantly outperforms human-feedback methods using only AI-generated data. The dual strategy of text-injection and image-noise triggering is a clever, scalable solution to the alignment problem.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning Vision Large Language Models (VLLMs) to align text generation with visual inputs using preference data

Inputs: Image x and text prompt

Outputs: Text response y

Pipeline Flow

Data Generation Stage 1: GPT-4V Hallucination Injection
Data Generation Stage 2: Image Distortion Triggering
Fine-tuning Stage: DPO Training

System Modules

GPT-4V Hallucinator (Data Generation)

Rewrites ground-truth captions to include plausible hallucinations (object co-occurrence, wrong attributes, logical errors)

Model or implementation: GPT-4V

Image Distorter (Data Generation)

Adds diffusion noise to input images during training to trigger inherent model hallucinations

Model or implementation: Gaussian noise function

VLLM Policy

The target model being optimized to prefer ground truth over hallucinations

Model or implementation: LLaVA-1.5 (7B)

Novel Architectural Elements

Integration of real-time image noise injection within the DPO loss loop to trigger and penalize inherent model hallucinations during training

Modeling

Base Model: LLaVA-1.5 (7B)

Training Method: Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Optimize policy to prefer ground truth over AI-generated hallucinations.

Formally: DPO loss minimizing log sigmoid of likelihood ratio between preferred and dispreferred responses relative to reference model.
Purpose: Mitigate inherent hallucinations triggered by noise.

Formally: Modified DPO loss incorporating terms for both text-injected hallucinations and noise-triggered hallucinations (Eqn 5).

Training Data:

17K examples randomly sampled from LLaVA-Instruct-150K dataset
Original answers treated as preferred
Dispreferred answers generated via GPT-4V and noise injection

Key Hyperparameters:

learning_rate: 1e-7
batch_size: 1
lora_r: 128
+ 4 more
lora_alpha: 256
mm_projector_lr: 2e-5
noise_step: 500
model_max_length: 1024

Compute: 1 A100 80GB GPU for approximately 6 hours

Comparison to Prior Work

vs. LLaVA-RLHF: POVID uses entirely AI-generated feedback (GPT-4V + Noise) rather than human annotations
vs. RLHF-V: POVID targets inherent hallucination patterns via image distortion, whereas RLHF-V relies on correcting text segments
vs. Silkie: POVID explicitly constructs negative pairs from ground truth modifications rather than relying on standard model generations
+ 1 more
vs. HA-DPO [not cited in paper]: Similar focus on hallucination-aware DPO, but POVID uniquely uses image noise to trigger internal priors

Limitations

Relies on GPT-4V for generating textual dispreferences, which may have its own biases or costs
Image noise strategy requires careful tuning of noise steps to balance distortion and recognizability
Evaluation is primarily on 7B parameter models; scaling to larger models is not explicitly tested

Reproducibility

Code: https://github.com/YiyangZhou/POVID

Code and data are publicly available at https://github.com/YiyangZhou/POVID. The paper details prompts for GPT-4V and the noise injection formula. Base model LLaVA-1.5 is open source.

📊 Experiments & Results

Evaluation Setup

Evaluation on hallucination benchmarks and comprehensive VLLM capability benchmarks

Benchmarks:

CHAIR (Image Captioning Hallucination)
POPE (Object Existence (Yes/No))
MMHal (Hallucination & Informativeness)
LLaVA-Bench (Multimodal Chat/Reasoning)
MMBench (Perception and Reasoning)

Metrics:

CHAIRS (Sentence-level hallucination rate)
CHAIRi (Instance-level hallucination rate)
POPE F1/Accuracy
MMHal Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
POVID significantly reduces hallucinations compared to baselines on dedicated hallucination benchmarks.
CHAIR	CHAIRS	66.8	31.8	-35.0
POPE	Accuracy/F1	85.90	86.90	+1.00
POVID improves general capabilities across comprehensive benchmarks, showing alignment doesn't degrade general performance.
LLaVA-Bench	Score	63.4	68.7	+5.3
MMBench	Score	63.0	64.9	+1.9
Comparison with other open-source VLLMs shows POVID (7B) competitive with larger models.
MM-Vet	Score	26.2	31.8	+5.6
Ablation study confirms both text hallucination and image distortion strategies contribute to performance.
CHAIR	CHAIRS	39.6	31.8	-7.8

Experiment Figures

Attention maps comparing LLaVA-1.5 and POVID on image captioning and VQA tasks.

Main Takeaways

POVID effectively reduces hallucination rates (CHAIRS) by nearly 50% compared to the LLaVA-1.5 baseline.
The method improves general multimodal performance (LLaVA-Bench, MM-Vet) rather than trading it off for safety, which is a common issue in alignment.
Both components (GPT-4V generated text negatives and noise-triggered image negatives) are necessary; ablating either reduces performance.
Attention analysis shows POVID shifts model focus from textual context back to visual tokens, correcting the root cause of many hallucinations.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Models (VLLMs)
Familiarity with Reinforcement Learning from Human Feedback (RLHF)
Knowledge of Direct Preference Optimization (DPO)

Key Terms

DPO: Direct Preference Optimization—a method to fine-tune models on preference pairs (winner/loser) by optimizing a classification loss, avoiding a separate reward model

VLLM: Vision Large Language Model—a multimodal model combining a visual encoder with a large language model to process images and text

CHAIR: Captioning Hallucination Assessment with Image Relevance—a metric evaluating the accuracy of object descriptions by comparing captioned objects to ground truth

POPE: Polling on Object Existence—a benchmark using binary Yes/No questions to test if a model hallucinates non-existent objects

diffusion noise: Gaussian noise added to an image, used here to disrupt visual features and trigger the model's reliance on language priors (hallucinations)

RLHF: Reinforcement Learning from Human Feedback—training method using human preferences to guide model behavior

hallucination: Generations where the model produces content not grounded in the input image, often based on language priors or spurious correlations

object co-occurrence: The statistical tendency of certain objects (e.g., knife and fork) to appear together, which can mislead models to hallucinate one when only the other is present