Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Preference Optimization Reasoning

The paper introduces a scalable pipeline for generating multimodal reasoning preference data and a Mixed Preference Optimization method that combines relative preference, absolute quality, and generation losses to improve MLLM Chain-of-Thought performance.

Core Problem

Open-source MLLMs suffer from distribution shifts during Chain-of-Thought (CoT) reasoning, often performing worse with CoT than with direct answers due to the disconnect between teacher-forced training and autoregressive inference.

Why it matters:

CoT is crucial for complex reasoning, but current SFT methods degrade performance on MLLMs (e.g., InternVL2-8B drops from 58.3 to 56.8 on MathVista with CoT)
Existing multimodal preference datasets focus on hallucination reduction in natural images, lacking scientific/reasoning data
Annotating reasoning processes for multimodal data is prohibitively expensive and time-consuming

Concrete Example: On MathVista, InternVL2-8B scores 58.3 with direct answers but drops to 56.8 when forced to use Chain-of-Thought. The model fails to maintain coherent long-context reasoning because the SFT loss does not account for the distribution shift between training (teacher forcing) and inference (autoregressive generation).

Key Novelty

Mixed Preference Optimization (MPO) and Scalable Preference Data Pipeline (MMPR)

Creates a large-scale preference dataset (MMPR) using 'Dropout Next Token Prediction' to automatically generate negative reasoning samples by truncating and completing responses without image access
Proposes MPO, a training objective combining DPO (relative preference), BCO (absolute quality), and SFT (generation capability) to align models with high-quality reasoning paths without a reward model

Architecture

The automated preference data construction pipeline showing two paths: Correctness-based for ground truth data and Dropout NTP for open-ended data.

Evaluation Highlights

+8.7 accuracy improvement on MathVista for InternVL2-8B-MPO (67.0) compared to the base InternVL2-8B model (58.3)
InternVL2-8B-MPO achieves performance comparable to the 10x larger InternVL2-76B on MathVista
Data construction pipeline reduces token cost to 57.5% of the RLAIF-V divide-and-conquer method while maintaining effectiveness

Breakthrough Assessment

8/10

Significant performance gains on hard reasoning benchmarks (MathVista) and a clever, scalable data construction method (Dropout NTP) that addresses the bottleneck of multimodal preference data scarcity.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Chain-of-Thought (CoT) reasoning optimization

Inputs: Image I and instruction x

Outputs: Reasoning chain followed by final answer y

Pipeline Flow

Data Construction (DropoutNTP & Correctness-based)
Training (Mixed Preference Optimization)
Inference (CoT Generation)

System Modules

Preference Data Generator

Generate pairs of chosen (y_c) and rejected (y_r) responses

Model or implementation: InternVL2 series (M_0)

Policy Model

Multimodal language model being optimized

Model or implementation: InternVL2-8B and InternVL2-76B

Novel Architectural Elements

MPO Loss function: A weighted sum of DPO (relative preference), BCO (absolute quality), and SFT (generation/stability) losses applied simultaneously

Modeling

Base Model: InternVL2-8B and InternVL2-76B

Training Method: Mixed Preference Optimization (MPO)

Objective Functions:

Purpose: Learn relative preference between responses.

Formally: L_DPO = -E[log σ(β log(π_θ(y_c|x)/π_ref(y_c|x)) - β log(π_θ(y_r|x)/π_ref(y_r|x)))]
Purpose: Learn absolute quality of individual responses using binary classification loss.

Formally: L_BCO = L_q^+ + L_q^- where L_q^+ = -log(σ(β log π(y_c|x) - δ)) and L_q^- = -log(1 - σ(β log π(y_r|x) - δ))
Purpose: Maintain generation capability and stabilize training (SFT).

Formally: L_SFT = -E[log π_θ(y|x)]

Training Data:

MMPR dataset: ~3 million samples total
750K samples without clear ground truth (DropoutNTP)
2.5M samples with clear ground truth (Correctness-based)
Sources include VQAv2, ScienceQA, MathVista, OCR, etc.

Key Hyperparameters:

beta (KL penalty): Not explicitly reported in the paper
delta (reward shift): Moving average of previous rewards

Compute: Not reported in the paper

Comparison to Prior Work

vs. RLAIF-V: MPO's data pipeline (DropoutNTP) is 42.5% cheaper (57.5% of cost) while achieving comparable hallucination reduction
vs. Standard DPO: MPO adds BCO (quality) and SFT (generation) losses to prevent degradation of reasoning chains and repetition issues observed with pure DPO
vs. LLaVA-RLHF [not cited in paper]: Focuses specifically on reasoning/CoT rather than general alignment or hallucination

Limitations

Correctness-based data pipeline excludes general VQA and document sources due to difficulty in heuristic verification (false negatives)
DropoutNTP relies on the assumption that removing image context always degrades quality, which generally holds but is heuristic
Specific hyperparameters for MPO (weights w*) are not detailed in the text

Reproducibility

Code: https://github.com/OpenGVLab/InternVL

Code, data, and model are released. The paper details the data sources and the exact logic for DropoutNTP and Correctness-based data construction.

📊 Experiments & Results

Evaluation Setup

Multimodal reasoning and hallucination benchmarks

Benchmarks:

MathVista (Multimodal mathematical reasoning)
M3CoT (Multimodal Chain-of-Thought)
HallusionBench (Visual hallucination evaluation)
MMBench (General multimodal evaluation)

Metrics:

Accuracy
F1 Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MathVista results show massive improvements for the 8B model, bridging the gap to the 76B model.
MathVista	Accuracy	66.5	67.0	+0.5
Hallucination benchmarks confirm that improving reasoning does not come at the cost of increased hallucinations; in fact, it reduces them.
Comparison of data construction costs showing efficiency gains.

Experiment Figures

Performance comparison on MathVista between Direct Answer and Chain-of-Thought (CoT) for base models vs. MPO models.

Main Takeaways

SFT often degrades CoT performance in MLLMs due to distribution shift; Preference Optimization (PO) reverses this trend.
Pure DPO can lead to repetitive responses or failed rationales in MLLMs; mixing BCO (quality) and SFT (generation) losses (MPO) stabilizes training.
The 'Dropout NTP' method effectively generates negative samples for open-ended tasks by removing visual context, creating a scalable path for multimodal preference data without human annotation.
The approach scales: improvements are observed in both 8B and 76B parameter models.

📚 Prerequisite Knowledge

Prerequisites

Direct Preference Optimization (DPO)
Supervised Fine-Tuning (SFT)
Chain-of-Thought (CoT) prompting
Multimodal Large Language Models (architecture)

Key Terms

MPO: Mixed Preference Optimization—a method combining preference loss, quality loss, and generation loss

DPO: Direct Preference Optimization—optimizing a policy to satisfy preferences without an explicit reward model

BCO: Binary Classifier Optimization—a quality loss method treating the policy as a binary classifier to distinguish absolute quality of responses

DropoutNTP: Dropout Next Token Prediction—a data construction method where negative samples are generated by truncating a good response and asking the model to complete it without looking at the image (inducing hallucination)

SFT: Supervised Fine-Tuning—training the model on labeled input-output pairs using standard next-token prediction

CoT: Chain-of-Thought—prompting the model to generate intermediate reasoning steps before the final answer

MMPR: MultiModal PReference dataset—the large-scale dataset constructed in this paper

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution