Tsinghua University,
National University of Singapore
arXiv
(2023)
MMRLFactuality
📝 Paper Summary
Multimodal Large Language Models (MLLMs)AI Safety & Alignment
RLHF-V reduces MLLM hallucinations by collecting segment-level human corrections and optimizing the model with Dense Direct Preference Optimization to prioritize factual segments over linguistic variations.
Core Problem
Existing MLLMs frequently hallucinate content not present in images, and standard RLHF using whole-response ranking suffers from annotation ambiguity and inefficient credit assignment.
Why it matters:
Hallucinations make MLLMs untrustworthy for high-stakes applications like autonomous driving or assisting visually impaired individuals
Ranking entire responses is ambiguous when a long response contains both correct details and hallucinations, making it hard for annotators to choose
Sparse ranking signals (A > B) are inefficient for learning fine-grained factual behaviors, often leading to reward hacking based on non-robust biases (e.g., response length)
Concrete Example:When describing a clock, a model might correctly identify the object but hallucinate the time. A rank-based annotator struggles to rank this against a response that misses the clock but describes the background correctly. RLHF-V has the human explicitly correct just the time segment.
Key Novelty
Dense Direct Preference Optimization (DDPO) on Segment-Level Corrections
Collects feedback as specific text segment corrections (e.g., changing 'three dogs' to 'two dogs') rather than ranking full responses, isolating the exact error
Modifies the DPO objective to calculate response likelihood as a weighted sum, giving significantly higher weight to the corrected segments than unchanged parts
Treats the corrected response as the positive sample and the original hallucinated response as the negative sample in a supervised optimization framework
Architecture
Comparison of traditional RLHF ranking vs. RLHF-V's fine-grained correction. It illustrates how ranking (Option A vs B) is ambiguous when responses have mixed quality, whereas RLHF-V's correction explicitly fixes the 'red' (hallucinated) segments to 'green' (correct) segments.
Evaluation Highlights
Reduces object hallucination rate of the base MLLM by 34.8% using only 1.4k annotated samples
Outperforms concurrent LLaVA-RLHF (which used 10k annotated samples) despite using 7x less data
Demonstrates better robustness than GPT-4V in preventing hallucinations caused by over-generalization in qualitative checks
Breakthrough Assessment
8/10
Significant efficiency gain (beating a 10k-sample baseline with 1.4k samples) and a methodologically sound shift from coarse ranking to fine-grained correction for MLLMs.
⚙️ Technical Details
Problem Definition
Setting: Aligning Multimodal Large Language Models to human factuality preferences using feedback
Inputs: Image x and text prompt
Outputs: Text response y (factually grounded in x)
Model or implementation: Muffin (based on context of data collection)
Human Annotator
Identifies and rewrites hallucinated segments in the model output
Model or implementation: Human
Novel Architectural Elements
Dense Direct Preference Optimization (DDPO) loss function integration
Segment-weighted token scoring mechanism during training (weighting factor gamma applied to corrected segments)
Modeling
Base Model: Muffin (instruction-tuned MLLM)
Training Method: Dense Direct Preference Optimization (DDPO)
Objective Functions:
Purpose: Optimize policy to prefer corrected segments over hallucinated ones.
Formally: L_DDPO uses a weighted log-likelihood where corrected segments are scaled by gamma > 1.
Training Data:
1.4k prompts from instruction tuning datasets and GPT-4 generated prompts
Responses generated by Muffin model
Human annotations: 64.4 words average length, 2.65 corrected segments per response
Key Hyperparameters:
gamma: > 1 (weighting parameter for corrected segments)
Compute: Not reported in the paper
Comparison to Prior Work
vs. LLaVA-RLHF: RLHF-V uses fine-grained segment corrections (DDPO) instead of whole-response ranking (PPO/DPO), achieving better results with 1/7th the data
vs. GPT-4V: RLHF-V focuses specifically on reducing hallucinations from over-generalization, claiming better robustness in those specific failure modes
Limitations
Dependent on high-quality human annotations for corrections
Focuses primarily on hallucination reduction, potentially at the cost of other metrics (though not explicitly stated as a failure mode)
Requires an existing instruction-tuned base model to generate initial responses for correction
Code, data, and model weights are open-sourced at https://github.com/RLHF-V/RLHF-V. The paper details the data collection process (segment-level corrections) and the exact modification to the DPO objective (weighted segments).
📊 Experiments & Results
Evaluation Setup
Evaluated on trustworthiness/hallucination across multiple MLLM benchmarks.
Benchmarks:
HallusionBench (Hallucination Evaluation)
POPE (Object Hallucination Evaluation)
MME (Multimodal Evaluation)
MMBench (Multimodal Evaluation)
LLaVA-Bench (General MLLM Capability)
Metrics:
Hallucination Rate
Trustworthiness
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Internal Eval / General
Hallucination Rate Reduction
0.0
34.8
+34.8
Comparison vs LLaVA-RLHF
Training Samples Required
10000
1400
-8600
Main Takeaways
Fine-grained segment-level corrections are significantly more data-efficient than holistic ranking for aligning MLLMs (1.4k vs 10k samples).
Dense Direct Preference Optimization (DDPO) effectively leverages local error information, preventing the credit assignment problem inherent in standard RLHF.
Simple training recipes like high-quality VQA post-training and disabling random image cropping also contribute to reducing vision-language mismatch and hallucinations.
📚 Prerequisite Knowledge
Prerequisites
Understanding of Multimodal LLMs (e.g., LLaVA)
Familiarity with Reinforcement Learning from Human Feedback (RLHF)
Knowledge of Direct Preference Optimization (DPO)
Key Terms
MLLM: Multimodal Large Language Model—an AI that processes both images and text to generate text outputs
RLHF: Reinforcement Learning from Human Feedback—a method to tune models using human preferences, typically rankings
DPO: Direct Preference Optimization—an algorithm that optimizes the policy directly on preference data without training a separate reward model
DDPO: Dense Direct Preference Optimization—the authors' proposed variant of DPO that weights specific segments of text more heavily
Hallucination: When a model generates text descriptions (objects, attributes, actions) that are not actually present in the associated image
Segment-level correction: Human feedback provided by rewriting only the incorrect phrase in a sentence rather than rating the whole sentence
VQA: Visual Question Answering—a task where the model answers questions about an image