Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

📝 Paper Summary

Video Instruction Following RLHF / DPO for Multimodal Models Video Hallucination Reduction

The paper aligns video LMMs by using detailed video captions as text-based evidence for reward modeling, enabling effective Direct Preference Optimization (DPO) without expensive video-based reward models.

Core Problem

Aligning video Large Multimodal Models (LMMs) is difficult because existing reward models struggle to detect hallucinations in video responses, and human or GPT-4V preference data is prohibitively expensive to scale.

Why it matters:

Current RLHF/DPO methods work well for text but struggle with multimodal inputs due to the scarcity of alignment data.
Hallucinations in video QA are hard to detect without costly frame-by-frame analysis.
Collecting human preference data for video is slow and expensive (e.g., LLaVA-RLHF cost $3000 for just 10k instances).

Concrete Example: In a video QA task about a space scene, a standard SFT model hallucinates 'I'm not scared of space' when the audio/video doesn't contain it. A text-only reward model might miss this, while a GPT-4V reward model is too expensive to run on thousands of training examples.

Key Novelty

Factually Augmented RLHF via Caption Proxies

Uses detailed text captions (generated by GPT-4V) as a proxy for video content, allowing a cheaper text-only LLM to serve as the reward model.
Constructs a massive dataset (ShareGPTVideo) of 900k detailed video captions to support this text-based factual grounding.
Apply Direct Preference Optimization (DPO) using rewards derived from this text-evidence mechanism to fine-tune the video LMM.

Evaluation Highlights

+8.1% accuracy improvement on Video QA tasks using LLaVA-Hound-DPO compared to its SFT counterpart.
The proposed text-based reward mechanism achieves >70% agreement with the much more expensive GPT-4V reward model.
Generated caption-based reward labeling costs <$20 for 120k pairs, compared to ~$3000 for 10k human labels.

Breakthrough Assessment

8/10

Significant for demonstrating that text captions can effectively proxy video for alignment, drastically reducing the cost of multimodal RLHF/DPO while achieving SOTA results.

⚙️ Technical Details

Problem Definition

Setting: Video Instruction Following and Question Answering

Inputs: Video V and text question x

Outputs: Text response y

Pipeline Flow

Caption Generation (GPT-4V creates detailed captions for 900k videos)
Instruction Generation (ChatGPT creates QA pairs from captions)
SFT Training (Fine-tune Video-LLaVA on generated instructions)
Preference Data Generation (Sample SFT responses, score via ChatGPT using captions as evidence)
DPO Training (Optimize model using preference pairs)

System Modules

Caption Generator

Create dense textual representations of videos to serve as grounding truth

Model or implementation: GPT-4V

Reward/Scoring Model

Score generated responses based on factual alignment with the caption

Model or implementation: ChatGPT (gpt-3.5-turbo)

Policy Model (LLaVA-Hound)

Generate video responses; optimized via DPO

Model or implementation: Video-LLaVA (LanguageBind encoder + Vicuna LLM)

Modeling

Base Model: Video-LLaVA (LanguageBind encoder + MLP projector + Vicuna LLM)

Training Method: Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Optimize policy to prefer higher-ranked responses.

Formally: DPO loss L_DPO = -E[log sigma(beta * log(pi_theta(yw|x,V)/pi_ref(yw|x,V)) - beta * log(pi_theta(yl|x,V)/pi_ref(yl|x,V)))]

Trainable Parameters: Projector and LLM (Visual encoder frozen)

Training Data:

Pre-training: 650k image captions + 900k video captions (ShareGPTVideo)
SFT: 600k image instructions + 240k video instructions
DPO: 17k preference pairs selected from 120k candidates

Key Hyperparameters:

learning_rate: 5e-7
batch_size: 128
epochs: 3
+ 2 more
beta: 0.1
temperature: 1.0 (for sampling DPO candidates)

Compute: 8 A100 GPUs

Comparison to Prior Work

vs. Video-LLaVA: Adds DPO stage with caption-based rewards
vs. VLM-RLAIF: Uses DPO instead of PPO; uses detailed captions as evidence rather than direct video evaluation
vs. LLaVA-RLHF: Uses AI-generated captions and rewards instead of human feedback [not cited in paper as direct baseline, but methodologically distinct]

Limitations

Relies on the quality of GPT-4V generated captions; errors in captions propagate to rewards.
The benchmark evaluation often uses single-word answers, limiting assessment of long-form helpfulness.
Text-based proxies may miss fine-grained visual details not captured in captions.

Reproducibility

Code and ShareGPTVideo dataset (900k captions) are publicly available. LLaVA-Hound-DPO model checkpoints are released. Evaluation uses ChatGPT (gpt-3.5-turbo-0613) and GPT-4V.

📊 Experiments & Results

Evaluation Setup

Video Question Answering on standard benchmarks

Benchmarks:

MSVD-QA (Video QA)
MSRVTT-QA (Video QA)
TGIF-QA (Video QA)

Metrics:

Accuracy (assessed by ChatGPT)
Score (1-5 scale)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DPO training significantly improves performance over the SFT baseline and other SOTA models across combined benchmarks.
Average (MSVD, MSRVTT, TGIF)	Accuracy	62.65	70.75	+8.10
Average (MSVD, MSRVTT, TGIF)	Accuracy	59.40	70.75	+11.35
Average (MSVD, MSRVTT, TGIF)	Accuracy	66.50	70.75	+4.25
Average (MSVD, MSRVTT, TGIF)	Accuracy	67.80	70.75	+2.95

Main Takeaways

Detailed video captions can effectively substitute for video content in reward modeling, enabling cheap and scalable DPO.
The proposed LLaVA-Hound-DPO sets a new SOTA for Video QA, outperforming both its SFT base and other recent models like LLaMA-VID.
Text-based reward calculation (ChatGPT + Captions) correlates well (Pearson 0.47) with expensive Vision-based reward calculation (GPT-4V + Frames).
Pre-training on large-scale video captions (ShareGPTVideo) improves generalization, particularly for out-of-domain tasks.

📚 Prerequisite Knowledge

Prerequisites

Direct Preference Optimization (DPO)
Large Multimodal Models (LMMs)
Reinforcement Learning with Human Feedback (RLHF)

Key Terms

DPO: Direct Preference Optimization—an algorithm that optimizes language models to satisfy preferences directly without training a separate reward model

SFT: Supervised Fine-Tuning—the initial training phase using labeled examples before preference optimization

LLaVA: Large Language and Vision Assistant—a popular architecture for multimodal models connecting a vision encoder to an LLM

Hallucination: In this context, generating text responses that claim facts not present in or contradicted by the video content

Chain-of-thought: A prompting technique where the model generates intermediate reasoning steps before the final answer

Video-LLaVA: The specific backbone architecture used in this paper, utilizing LanguageBind encoder and Vicuna LLM