← Back to Paper List

Multi-modal preference alignment remedies regression of visual instruction tuning on language model

Shengzhi Li, Rongyu Lin, Shichao Pei
University of Massachusetts Boston, TIFIN Inc, King Abdullah University of Science and Technology
Annual Meeting of the Association for Computational Linguistics (2024)
MM RL Factuality Benchmark

📝 Paper Summary

Multi-modal Large Language Models (MLLMs) Model Alignment
Applying Direct Preference Optimization on a small, AI-annotated multi-modal dataset restores the text instruction-following capabilities of MLLMs that are typically degraded during visual instruction tuning.
Core Problem
Visual Instruction Tuning (VIT) significantly degrades the pure language capabilities of Multi-modal LLMs because visual datasets lack the complexity and diversity of text-only instruction data.
Why it matters:
  • Models like LLaVA perform worse on text-only tasks than their base LLMs (e.g., Vicuna), creating a 'tax' for adding vision capabilities.
  • Production MLLMs need to handle interleaved image-text turns without losing the reasoning or coding abilities of the underlying language model.
  • Current alignment methods like RLHF are computationally expensive and rely on scarce human-annotated multi-modal preference data.
Concrete Example: When the LLaVA model is fine-tuned on visual data, its performance on the text-only MT-Bench drops to 5.92, significantly lower than its base model Vicuna-13B (6.57) or even the smaller Vicuna-7B.
Key Novelty
Distillation-based Multi-modal Preference Alignment
  • Uses a strong multi-modal model (Gemini Pro) to generate fine-grained quality ratings (helpfulness, correctness, coherence) for responses generated by a weaker model (LLaVA).
  • Constructs a preference dataset where the highest-rated response is 'chosen' and low-rated ones are 'rejected', filtering for clear quality gaps.
  • Applies Direct Preference Optimization (DPO) to align the weaker model with these distilled preferences, bypassing the need for a separate reward model.
Evaluation Highlights
  • Surpasses the text instruction-following capability of the base language model (Vicuna) by reaching 6.73 on MT-Bench (vs. Vicuna's 6.57).
  • Achieves a +6% improvement on LLaVA-Bench and +4.9% on MM-Vet compared to the LLaVA baseline, showing gains in open-ended visual tasks.
  • Maintains visual knowledge performance with minimal degradation (66.8 on MM-Bench) compared to significant drops seen in prior RLHF approaches (60.1).
Breakthrough Assessment
7/10
Effective demonstration that DPO with AI-distilled feedback can fix modality degradation. While the method combines existing techniques (DPO + AI feedback), applying it to the specific problem of MLLM forgetting is valuable.
×