National University of Singapore,
Fudan University,
Harbin Institute of Technology,
Eastern Institute of Technology
arXiv
(2025)
MMRLBenchmark
📝 Paper Summary
Alt-text GenerationMultimodal Alignment
MCM-DPO improves alt-text generation by extending Direct Preference Optimization to include visual and contextual preference pairs, rather than optimizing solely on text responses.
Core Problem
Existing MLLMs produce verbose captions rather than concise alt-text, and supervised fine-tuning fails due to noisy, inconsistent user-generated annotations on social media.
Why it matters:
2.2 billion people have visual impairments, yet up to 98% of Twitter images lack alt-text, limiting access to digital content
Standard MLLMs are biased toward image captioning (detailed descriptions) rather than the context-aware, functional summaries required for alt-text
Reliance on Supervised Fine-Tuning (SFT) limits performance because manually cleaning large-scale noisy alt-text data is labor-intensive
Concrete Example:A standard image captioning model might describe a Twitter image in excessive visual detail (e.g., 'A person standing next to a tree wearing a blue shirt...'), whereas a blind user requires concise, context-aware alt-text relevant to the tweet's post text.
Key Novelty
Multifaceted Cross-Modal Direct Preference Optimization (MCM-DPO)
Extends standard DPO by constructing preference pairs not just for the text response, but also for the image (visual preference) and the surrounding context (contextual preference)
Optimizes across seven distinct dimensions: single preferences (Response, Image, Context), pairwise combinations (e.g., Image+Response), and multi-preference (all three combined) to align modalities
Utilizes negative samples for images (e.g., rotated images) and contexts (randomly swapped contexts) to teach the model to distinguish correct cross-modal alignment
Architecture
The MCM-DPO framework showing the seven preference optimization dimensions derived from image, context, and response combinations.
Evaluation Highlights
Constructed TAlt and PAlt datasets comprising 202k annotated alt-text samples and 18k preference pairs from Twitter and Pinterest
Proposed MCM-DPO consistently outperforms both standard DPO and SFT baselines on TAlt and PAlt benchmarks (qualitative claim from abstract, specific numbers not in provided text)
Establishes a new state-of-the-art performance for alt-text generation by reducing reliance on accurate target annotations
Breakthrough Assessment
7/10
Novel extension of DPO to non-text modalities (image/context inputs) addresses a specific, high-impact accessibility problem. The large-scale dataset contribution is significant.
⚙️ Technical Details
Problem Definition
Setting: Multimodal generation where the model produces a text description y given an image m, context c, and prompt x
Inputs: Prompt x, Image m, Context c (post text)
Outputs: Alt-text response y
Pipeline Flow
Vision Encoder (Processes Image)
Projection Layer
Large Language Model (Processes Context + Image Features)
Generation (Produces Alt-text)
System Modules
Vision Encoder
Extract visual features from the input image
Model or implementation: Unspecified in provided text (likely CLIP or SigLIP based)
Large Language Model
Generate alt-text based on visual features and textual context
Model or implementation: Unspecified MLLM backbone
Novel Architectural Elements
Optimization objective includes seven distinct loss components covering single, pairwise, and multi-modal preference dimensions (RPO, VPO, CPO, VRPO, CRPO, VCPO, MTPO)
Modeling
Base Model: Unspecified MLLM (text provided does not specify the exact backbone, e.g., LLaVA/Vicuna)
Training Method: Two-stage training: (1) SFT on large-scale noisy data, (2) MCM-DPO on high-quality preference pairs