← Back to Paper List

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, Yiming Yang
Columbia University
North American Chapter of the Association for Computational Linguistics (2024)
MM RL Factuality Benchmark

📝 Paper Summary

Video Instruction Following RLHF / DPO for Multimodal Models Video Hallucination Reduction
The paper aligns video LMMs by using detailed video captions as text-based evidence for reward modeling, enabling effective Direct Preference Optimization (DPO) without expensive video-based reward models.
Core Problem
Aligning video Large Multimodal Models (LMMs) is difficult because existing reward models struggle to detect hallucinations in video responses, and human or GPT-4V preference data is prohibitively expensive to scale.
Why it matters:
  • Current RLHF/DPO methods work well for text but struggle with multimodal inputs due to the scarcity of alignment data.
  • Hallucinations in video QA are hard to detect without costly frame-by-frame analysis.
  • Collecting human preference data for video is slow and expensive (e.g., LLaVA-RLHF cost $3000 for just 10k instances).
Concrete Example: In a video QA task about a space scene, a standard SFT model hallucinates 'I'm not scared of space' when the audio/video doesn't contain it. A text-only reward model might miss this, while a GPT-4V reward model is too expensive to run on thousands of training examples.
Key Novelty
Factually Augmented RLHF via Caption Proxies
  • Uses detailed text captions (generated by GPT-4V) as a proxy for video content, allowing a cheaper text-only LLM to serve as the reward model.
  • Constructs a massive dataset (ShareGPTVideo) of 900k detailed video captions to support this text-based factual grounding.
  • Apply Direct Preference Optimization (DPO) using rewards derived from this text-evidence mechanism to fine-tune the video LMM.
Evaluation Highlights
  • +8.1% accuracy improvement on Video QA tasks using LLaVA-Hound-DPO compared to its SFT counterpart.
  • The proposed text-based reward mechanism achieves >70% agreement with the much more expensive GPT-4V reward model.
  • Generated caption-based reward labeling costs <$20 for 120k pairs, compared to ~$3000 for 10k human labels.
Breakthrough Assessment
8/10
Significant for demonstrating that text captions can effectively proxy video for alignment, drastically reducing the cost of multimodal RLHF/DPO while achieving SOTA results.
×