RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

📝 Paper Summary

Multimodal Large Language Models (MLLMs) AI Safety & Alignment

RLHF-V reduces MLLM hallucinations by collecting segment-level human corrections and optimizing the model with Dense Direct Preference Optimization to prioritize factual segments over linguistic variations.

Core Problem

Existing MLLMs frequently hallucinate content not present in images, and standard RLHF using whole-response ranking suffers from annotation ambiguity and inefficient credit assignment.

Why it matters:

Hallucinations make MLLMs untrustworthy for high-stakes applications like autonomous driving or assisting visually impaired individuals
Ranking entire responses is ambiguous when a long response contains both correct details and hallucinations, making it hard for annotators to choose
Sparse ranking signals (A > B) are inefficient for learning fine-grained factual behaviors, often leading to reward hacking based on non-robust biases (e.g., response length)

Concrete Example: When describing a clock, a model might correctly identify the object but hallucinate the time. A rank-based annotator struggles to rank this against a response that misses the clock but describes the background correctly. RLHF-V has the human explicitly correct just the time segment.

Key Novelty

Dense Direct Preference Optimization (DDPO) on Segment-Level Corrections

Collects feedback as specific text segment corrections (e.g., changing 'three dogs' to 'two dogs') rather than ranking full responses, isolating the exact error
Modifies the DPO objective to calculate response likelihood as a weighted sum, giving significantly higher weight to the corrected segments than unchanged parts
Treats the corrected response as the positive sample and the original hallucinated response as the negative sample in a supervised optimization framework

Architecture

Comparison of traditional RLHF ranking vs. RLHF-V's fine-grained correction. It illustrates how ranking (Option A vs B) is ambiguous when responses have mixed quality, whereas RLHF-V's correction explicitly fixes the 'red' (hallucinated) segments to 'green' (correct) segments.

Evaluation Highlights

Reduces object hallucination rate of the base MLLM by 34.8% using only 1.4k annotated samples
Outperforms concurrent LLaVA-RLHF (which used 10k annotated samples) despite using 7x less data
Demonstrates better robustness than GPT-4V in preventing hallucinations caused by over-generalization in qualitative checks

Breakthrough Assessment

8/10

Significant efficiency gain (beating a 10k-sample baseline with 1.4k samples) and a methodologically sound shift from coarse ranking to fine-grained correction for MLLMs.

⚙️ Technical Details

Problem Definition

Setting: Aligning Multimodal Large Language Models to human factuality preferences using feedback

Inputs: Image x and text prompt

Outputs: Text response y (factually grounded in x)

Pipeline Flow

Input Image & Prompt -> MLLM Policy -> Response
Annotator (Human) -> Segment-Level Correction -> (Preferred Response y_w, Dispreferred Response y_l)
DDPO Training -> Update MLLM Weights

System Modules

Base MLLM

Generates initial responses (to be corrected)

Model or implementation: Muffin (based on context of data collection)

Human Annotator

Identifies and rewrites hallucinated segments in the model output

Model or implementation: Human

Novel Architectural Elements

Dense Direct Preference Optimization (DDPO) loss function integration
Segment-weighted token scoring mechanism during training (weighting factor gamma applied to corrected segments)

Modeling

Base Model: Muffin (instruction-tuned MLLM)

Training Method: Dense Direct Preference Optimization (DDPO)

Objective Functions:

Purpose: Optimize policy to prefer corrected segments over hallucinated ones.

Formally: L_DDPO uses a weighted log-likelihood where corrected segments are scaled by gamma > 1.

Training Data:

1.4k prompts from instruction tuning datasets and GPT-4 generated prompts
Responses generated by Muffin model
Human annotations: 64.4 words average length, 2.65 corrected segments per response

Key Hyperparameters:

gamma: > 1 (weighting parameter for corrected segments)

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLaVA-RLHF: RLHF-V uses fine-grained segment corrections (DDPO) instead of whole-response ranking (PPO/DPO), achieving better results with 1/7th the data
vs. GPT-4V: RLHF-V focuses specifically on reducing hallucinations from over-generalization, claiming better robustness in those specific failure modes

Limitations

Dependent on high-quality human annotations for corrections
Focuses primarily on hallucination reduction, potentially at the cost of other metrics (though not explicitly stated as a failure mode)
Requires an existing instruction-tuned base model to generate initial responses for correction

Reproducibility

Code: https://github.com/RLHF-V/RLHF-V

Code, data, and model weights are open-sourced at https://github.com/RLHF-V/RLHF-V. The paper details the data collection process (segment-level corrections) and the exact modification to the DPO objective (weighted segments).

📊 Experiments & Results

Evaluation Setup

Evaluated on trustworthiness/hallucination across multiple MLLM benchmarks.

Benchmarks:

HallusionBench (Hallucination Evaluation)
POPE (Object Hallucination Evaluation)
MME (Multimodal Evaluation)
MMBench (Multimodal Evaluation)
LLaVA-Bench (General MLLM Capability)

Metrics:

Hallucination Rate
Trustworthiness
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Internal Eval / General	Hallucination Rate Reduction	0.0	34.8	+34.8
Comparison vs LLaVA-RLHF	Training Samples Required	10000	1400	-8600

Main Takeaways

Fine-grained segment-level corrections are significantly more data-efficient than holistic ranking for aligning MLLMs (1.4k vs 10k samples).
Dense Direct Preference Optimization (DDPO) effectively leverages local error information, preventing the credit assignment problem inherent in standard RLHF.
Simple training recipes like high-quality VQA post-training and disabling random image cropping also contribute to reducing vision-language mismatch and hallucinations.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multimodal LLMs (e.g., LLaVA)
Familiarity with Reinforcement Learning from Human Feedback (RLHF)
Knowledge of Direct Preference Optimization (DPO)

Key Terms

MLLM: Multimodal Large Language Model—an AI that processes both images and text to generate text outputs

RLHF: Reinforcement Learning from Human Feedback—a method to tune models using human preferences, typically rankings

DPO: Direct Preference Optimization—an algorithm that optimizes the policy directly on preference data without training a separate reward model

DDPO: Dense Direct Preference Optimization—the authors' proposed variant of DPO that weights specific segments of text more heavily

Hallucination: When a model generates text descriptions (objects, attributes, actions) that are not actually present in the associated image

Segment-level correction: Human feedback provided by rewriting only the incorrect phrase in a sentence rather than rating the whole sentence

VQA: Visual Question Answering—a task where the model answers questions about an image