Multi-modal preference alignment remedies regression of visual instruction tuning on language model

📝 Paper Summary

Multi-modal Large Language Models (MLLMs) Model Alignment

Applying Direct Preference Optimization on a small, AI-annotated multi-modal dataset restores the text instruction-following capabilities of MLLMs that are typically degraded during visual instruction tuning.

Core Problem

Visual Instruction Tuning (VIT) significantly degrades the pure language capabilities of Multi-modal LLMs because visual datasets lack the complexity and diversity of text-only instruction data.

Why it matters:

Models like LLaVA perform worse on text-only tasks than their base LLMs (e.g., Vicuna), creating a 'tax' for adding vision capabilities.
Production MLLMs need to handle interleaved image-text turns without losing the reasoning or coding abilities of the underlying language model.
Current alignment methods like RLHF are computationally expensive and rely on scarce human-annotated multi-modal preference data.

Concrete Example: When the LLaVA model is fine-tuned on visual data, its performance on the text-only MT-Bench drops to 5.92, significantly lower than its base model Vicuna-13B (6.57) or even the smaller Vicuna-7B.

Key Novelty

Distillation-based Multi-modal Preference Alignment

Uses a strong multi-modal model (Gemini Pro) to generate fine-grained quality ratings (helpfulness, correctness, coherence) for responses generated by a weaker model (LLaVA).
Constructs a preference dataset where the highest-rated response is 'chosen' and low-rated ones are 'rejected', filtering for clear quality gaps.
Applies Direct Preference Optimization (DPO) to align the weaker model with these distilled preferences, bypassing the need for a separate reward model.

Evaluation Highlights

Surpasses the text instruction-following capability of the base language model (Vicuna) by reaching 6.73 on MT-Bench (vs. Vicuna's 6.57).
Achieves a +6% improvement on LLaVA-Bench and +4.9% on MM-Vet compared to the LLaVA baseline, showing gains in open-ended visual tasks.
Maintains visual knowledge performance with minimal degradation (66.8 on MM-Bench) compared to significant drops seen in prior RLHF approaches (60.1).

Breakthrough Assessment

7/10

Effective demonstration that DPO with AI-distilled feedback can fix modality degradation. While the method combines existing techniques (DPO + AI feedback), applying it to the specific problem of MLLM forgetting is valuable.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal instruction following and alignment

Inputs: Interleaved image and text instructions

Outputs: Textual response generated by the model

Pipeline Flow

Image/Text Input
Vision Encoder (CLIP/SigLIP)
Projection Layer
Language Model (Vicuna)

System Modules

Vision Encoder (Input Processing)

Encodes input images into visual feature embeddings

Model or implementation: CLIP-based (ViT-L/14-336px inferred from LLaVA-1.5 context)

Projector (Input Processing)

Maps visual embeddings into the LLM's token embedding space

Model or implementation: MLP (Multi-Layer Perceptron)

Language Model

Generates text response based on visual and text tokens

Model or implementation: LLaVA-1.5-13B (Vicuna-v1.5-13B backbone)

Modeling

Base Model: LLaVA-1.5-13B (derived from Vicuna-1.5-13B)

Training Method: Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Optimize policy to prefer high-quality answers over low-quality ones based on implicit reward.

Formally: DPO loss L_DPO (optimizing log-likelihood ratios of chosen vs rejected responses)

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: LoRA parameters (exact count not reported)

Training Data:

5,000 samples from SciGraphQA and LRV-Instruct
4 responses generated per prompt by LLaVA-1.5-13B (temp 0.7)
Annotated by Gemini-Vision Pro on 5 metrics (Helpfulness, Correctness, Coherence, etc.)
Preference pairs selected: 'Chosen' is highest score, 'Rejected' must be >2 points lower

Key Hyperparameters:

beta: 0.1
learning_rate: 5e-5
batch_size: Not reported in the paper
+ 1 more
lora_rank: Not reported in the paper

Compute: 4 A100-80G GPUs on Azure Cloud

Comparison to Prior Work

vs. LLaVA-RLHF: DPO is more stable, computationally cheaper (no separate reward model), and avoids the significant drop in visual knowledge benchmarks (60.1 vs 66.8) seen with RLHF.
vs. SteerLM/Rejection Sampling: DPO explicitly optimizes preference margins rather than just imitating high-score samples, leading to better generalization on open-ended benchmarks.
vs. Silkie [not cited in paper]: Silkie also explores DPO for MLLMs but focuses on different datasets; this paper specifically targets the 'modality degradation' phenomenon.

Limitations

Relies on a commercial model (Gemini Pro) for annotations, introducing potential biases or dependency on API availability.
The approach was tested primarily on LLaVA-1.5-13B; scalability to larger or different architectures is not extensively verified.
Hyperparameter sensitivity: Performance is sensitive to the 'beta' parameter in DPO (0.1 worked best).

Reproducibility

Code availability is not explicitly provided in the paper snippet. The method relies on Gemini Pro (commercial API) for data annotation, which is a closed-source dependency. Hyperparameters for DPO beta and learning rate are provided.

📊 Experiments & Results

Evaluation Setup

Evaluation on both pure text instruction following and multi-modal capabilities.

Benchmarks:

MT-Bench (Text-only multi-turn instruction following)
MM-Vet (Integrated visual-language capabilities)
LLaVA-Bench (Visual instruction following (in the wild))
MM-Bench (Visual knowledge and reasoning (multiple choice))
PoPE (Object hallucination detection)

Metrics:

GPT-4 scored rating (0-10 or 0-100)
Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DPO alignment significantly improves text instruction following capabilities compared to the standard visual instruction tuned baseline.
MT-Bench	Score (0-10)	5.99	6.73	+0.74
DPO maintains strong performance on visual knowledge benchmarks where previous RLHF methods caused significant regression.
MM-Bench	Accuracy (%)	60.1	66.8	+6.7
Open-ended visual instruction following is improved through preference alignment.
LLaVA-Bench	Score	Not reported in the paper	77.4	Not reported in the paper

Main Takeaways

Visual instruction tuning imposes a 'tax' on language models, degrading their pure text instruction-following abilities (e.g., LLaVA performs worse than Vicuna on MT-Bench).
Direct Preference Optimization (DPO) utilizing AI-generated preferences (from Gemini) effectively reverses this degradation, surpassing even the original base model's text performance.
DPO incurs a much lower 'alignment tax' on visual knowledge tasks (MM-Bench) compared to traditional RLHF approaches.
Rejection Sampling is more effective than DPO specifically for multi-choice visual benchmarks (PoPe, MM-Bench), while DPO dominates in open-ended generation tasks.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Visual Instruction Tuning (VIT)
Familiarity with Reinforcement Learning from Human Feedback (RLHF) concepts
Knowledge of DPO (Direct Preference Optimization)

Key Terms

MLLM: Multi-modal Large Language Model—an LLM adapted to process non-text inputs like images alongside text

DPO: Direct Preference Optimization—an alignment method that optimizes the policy directly on preference pairs without training an explicit reward model

Visual Instruction Tuning: Fine-tuning an LLM on pairs of images and instructions to enable visual understanding

Catastrophic Forgetting: The phenomenon where a model forgets previously learned information (e.g., text skills) when trained on new data (e.g., visual tasks)

SFT: Supervised Fine-Tuning—training a model to mimic reference answers

RLHF: Reinforcement Learning from Human Feedback—aligning models using a reward model trained on human preferences

SteerLM: A method that conditions model generation on specified attribute scores (e.g., helpfulness: 5) during training and inference

Modality Conflict: Interference between different data types (text vs. image) during training that harms performance in one or both domains

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of weights

Chain-of-Thought: Prompting the model to generate intermediate reasoning steps before the final answer