Aligning Large Multimodal Models with Factually Augmented RLHF

📝 Paper Summary

Large Multimodal Models (LMMs) Vision-Language Alignment Reinforcement Learning from Human Feedback (RLHF)

LLaVA-RLHF adapts reinforcement learning to multimodal models by augmenting the reward model with image captions and ground-truth answers to prevent it from being fooled by hallucinations.

Core Problem

Large Multimodal Models (LMMs) frequently hallucinate because they are trained on limited or lower-quality multimodal data compared to text-only models, causing misalignment between visual and textual modalities.

Why it matters:

Hallucinated outputs in multimodal systems (e.g., describing objects not present in an image) severely undermine user trust and practical utility in real-world applications.
Standard RLHF approaches suffer from 'reward hacking,' where the model optimizes for a high reward score without actually improving factual alignment, often due to weak reward models.
Collecting high-quality human preference data for multimodal tasks is expensive and scarce compared to text-only domains.

Concrete Example: When asking an LMM to describe an image, it might confidently describe 'a red car' when the image actually contains a blue truck. A standard reward model might accept the confident text if it flows well, failing to penalize the visual mismatch. LLaVA-RLHF's reward model sees the ground truth caption 'blue truck' and penalizes the 'red car' response.

Key Novelty

Factually Augmented RLHF (Fact-RLHF)

Augments the Reward Model with 'cheat sheets' (image captions, ground-truth options) during training, allowing it to detect hallucinations that a standard model might miss.
Enhances the Supervised Fine-Tuning (SFT) stage by converting high-quality human annotations (VQA, Flickr30k) into conversation formats, rather than relying solely on synthetic GPT-4 data.
Introduces MMHal-Bench, a new evaluation benchmark specifically designed to penalize hallucinations across 8 task types and 12 object categories.

Architecture

Illustration of the Factually Augmented RLHF (Fact-RLHF) framework compared to standard RLHF.

Evaluation Highlights

Achieves 94% of text-only GPT-4's performance level on LLaVA-Bench, surpassing the previous best methods which reached only 87%.
Achieves a 60% relative improvement on the new MMHal-Bench compared to baselines by specifically reducing hallucinations.
Establishes new performance benchmarks for LLaVA with 52.4% accuracy on MMBench and 82.7% F1 score on POPE.

Breakthrough Assessment

8/10

First successful application of RLHF to Large Multimodal Models for hallucination reduction. The method of augmenting the reward model with factual data addresses the key 'reward hacking' bottleneck in multimodal RL.

⚙️ Technical Details

Problem Definition

Setting: Multimodal alignment via Reinforcement Learning from Human Feedback (RLHF)

Inputs: Image I and text prompt x

Outputs: Text response y aligned with visual content

Pipeline Flow

Vision Encoder (processes image)
Projection Layer (maps image features to text space)
Large Language Model (generates response)

System Modules

Vision Encoder (Input Processing)

Encodes input images into visual feature embeddings

Model or implementation: CLIP ViT-L/14 (Pre-trained)

Projection Layer (Input Processing)

Projects visual features into the LLM's word embedding space

Model or implementation: Linear Layer

Large Language Model

Generates text response based on multimodal context

Model or implementation: Vicuna-V1.5 (7B or 13B)

Novel Architectural Elements

Factually Augmented Reward Model (Training only): The reward model architecture is modified to accept additional 'factual' inputs (captions, ground truth choices) alongside the image and prompt during the training phase, unlike the standard inference pipeline.

Modeling

Base Model: Vicuna-V1.5 (7B and 13B variants) with CLIP ViT-L/14 vision encoder

Training Method: Factually Augmented RLHF (Fact-RLHF) with PPO

Objective Functions:

Purpose: Train reward model to distinguish preferred responses.

Formally: Cross-entropy loss on pairwise comparisons: -log(sigmoid(r(x, y_preferred) - r(x, y_rejected))).
Purpose: Optimize policy to maximize reward while staying close to SFT model.

Formally: PPO objective with KL penalty: E[r(x, y) - beta * log(pi_RL(y|x) / pi_INIT(y|x))].
Purpose: Penalize wrong answers in multiple-choice questions.

Formally: Symbolic reward mechanism penalizing divergence from ground-truth options.
Purpose: Penalize verbosity to reduce hallucination.

Formally: Length penalty based on number of tokens.

Adaptation: LoRA (Low-Rank Adaptation) used for all fine-tuning processes (Policy, Reward, Value models)

Trainable Parameters: LoRA parameters (models fit on single GPU)

Training Data:

SFT Data: LLaVA synthetic data (98k) + VQA-v2 (83k) + A-OKVQA (16k) + Flickr30k (23k)
RLHF Data: 10k human preferences collected on hold-out LLaVA data, plus 12k A-OKVQA and 10k VQA-v2 samples

Key Hyperparameters:

sampling_temperature: 0.7
image_resolution_13b: 336x336
image_resolution_7b: 256x256

Compute: Models fit on one GPU using LoRA

Comparison to Prior Work

vs. LLaVA: LLaVA-RLHF uses RLHF with a factually augmented reward model and additional high-quality SFT data, whereas LLaVA relies on SFT with synthetic data.
vs. InstructBLIP/IDEFICS: LLaVA-RLHF specifically targets hallucination reduction via reinforcement learning, whereas others focus primarily on general instruction tuning.
vs. Standard RLHF [general concept]: Fact-RLHF augments the reward model with external ground truth (captions/answers) to prevent reward hacking, which standard multimodal RLHF lacks.

Limitations

The 'Honest' reward model concept is discussed but a piecewise Honesty-prioritized model is left for future work.
RLHF training can induce verbosity, which correlates with hallucinations, necessitating an explicit length penalty.
Requires ground-truth data (captions, answers) for the Fact-RLHF stage, which may not be available for all custom datasets.

Reproducibility

Code: https://llava-rlhf.github.io

Code, model, and data are publicly available at https://llava-rlhf.github.io. The reward model initialization uses LLaVA-SFT+ checkpoints. Human preference data collection templates are provided in the paper.

📊 Experiments & Results

Evaluation Setup

Multimodal generation and question answering, focusing on hallucination detection and general helpfulness.

Benchmarks:

MMHal-Bench (Hallucination evaluation) [New]
LLaVA-Bench (General multimodal chat evaluation)
MMBench (Multimodal capability evaluation)
POPE (Object hallucination evaluation)

Metrics:

Relative performance to GPT-4 (%)
Accuracy (%)
F1 score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on LLaVA-Bench comparing the proposed method against the previous best approaches relative to GPT-4.
LLaVA-Bench	Relative performance to GPT-4	87	94	+7

Main Takeaways

LLaVA-RLHF achieves a 60% improvement on MMHal-Bench compared to baselines, demonstrating the effectiveness of Fact-RLHF in reducing hallucinations.
Augmenting synthetic SFT data with high-quality human annotations (VQA, Flickr30k) significantly boosts performance even before RLHF.
The approach achieves new state-of-the-art results for LLaVA on standard benchmarks like MMBench (52.4%) and POPE (82.7% F1).
Fact-RLHF successfully mitigates reward hacking by providing the reward model with ground-truth context that the policy model might ignore or hallucinate over.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Proximal Policy Optimization (PPO)
Large Multimodal Models (LMM)
Supervised Fine-Tuning (SFT)

Key Terms

RLHF: Reinforcement Learning from Human Feedback—a method to align AI models with human values by training a reward model on human preferences and optimizing the policy against it

Reward Hacking: A failure mode where the AI optimizes for the reward score (the metric) rather than the intended high-quality behavior, often exploiting flaws in the reward model

SFT: Supervised Fine-Tuning—the initial training phase where the model learns to mimic high-quality demonstration data before RL is applied

Fact-RLHF: Factually Augmented RLHF—the paper's proposed method where the reward model is given extra factual context (captions, answers) to better judge the policy's truthfulness

PPO: Proximal Policy Optimization—an RL algorithm used to update the model's policy while ensuring stability by limiting how much the policy changes in one step

Hallucination: In LMMs, generating text that is not grounded in or contradicts the visual information provided in the image context

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices

LMM: Large Multimodal Model—a deep learning model capable of processing and generating output for multiple modalities, typically image and text

KL penalty: Kullback-Leibler penalty—a regularization term added to the RL loss to prevent the model from drifting too far from its initial learned behavior