OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

📝 Paper Summary

Multi-Modal Large Language Models (MLLMs) Human Preference Alignment

OmniAlign-V enhances multi-modal alignment by training models on a rigorously filtered, semantically rich dataset of 200K open-ended questions, bridging the gap between foundational capabilities and human preferences.

Core Problem

Open-source MLLMs excel at basic tasks (OCR, detection) but fail to align with human preferences in open-ended conversations, often producing short, unhelpful, or hallucinatory responses.

Why it matters:

Mixing high-quality text-only alignment data into MLLM training fails to improve multi-modal alignment and actively degrades foundational visual skills
Existing visual instruction datasets focus on short, factual QA, lacking the complexity, creativity, and length required for satisfying human-AI interaction
Current benchmarks prioritize objective accuracy over subjective helpfulness and user preference

Concrete Example: When asked an open-ended question about an image, a standard MLLM might give a brief, robotic description. In contrast, preliminary studies showed that while adding text-only alignment data improved text responses, it caused performance drops on visual benchmarks like MMMU (e.g., -1.2 points) and failed to improve visual alignment scores.

Key Novelty

Semantic Richness Filtering & Hybrid Task Taxonomy

Filters natural images not just by complexity (pixel randomness) but by semantic richness (object detection counts), rejecting chaotic but empty images (e.g., a field of identical tents)
Splits data generation into distinct pipelines for Natural images (Knowledge, Creative, Inferential) and Infographics (Charts, Posters), applying specialized refinement strategies like merging OCR results with LLM reasoning

Architecture

The data synthesis pipeline for OmniAlign-V, detailing image selection and Question-Answer generation/refinement.

Evaluation Highlights

Achieves 28.5% win-rate on MM-AlignBench with Qwen2.5-32B backbone, outperforming the much larger proprietary-data-tuned Qwen2VL-72B-Instruct (25.1%)
+13.6 point improvement on WildVision Score (alignment benchmark) when fine-tuning InternLM2.5-7B with OmniAlign-V compared to the LLaVA-Next-778K baseline
Maintains or improves foundational capabilities, achieving +1.6% on MMMU while simultaneously improving alignment metrics

Breakthrough Assessment

8/10

Strong contribution to data-centric AI for MLLMs. Addresses a critical alignment gap with a reproducible pipeline and high-quality artifacts (dataset + benchmark), showing clear gains over standard baselines.

⚙️ Technical Details

Problem Definition

Setting: Multi-Modal Instruction Tuning and Alignment

Inputs: Image I and natural language instruction X_q

Outputs: Natural language response X_a aligned with human preferences

Pipeline Flow

Vision Encoder (processes image)
Projector/Connector (maps vision features to text space)
LLM Backbone (generates response)

System Modules

Vision Encoder (Input Processing)

Extract visual features from input images

Model or implementation: CLIP-ViT-L/14-336px (for InternLM2.5) or SigLIP-SO400M (for Qwen2.5)

Projector (Input Processing)

Align visual features with the LLM's token embedding space

Model or implementation: MLP (Multi-Layer Perceptron)

LLM Backbone

Generate text response based on visual tokens and text instruction

Model or implementation: InternLM2.5-7B or Qwen2.5-32B

Modeling

Base Model: LLaVA-Next architecture with InternLM2.5-7B or Qwen2.5-32B backbones

Training Method: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Maximize likelihood of generating the correct next token given context.

Formally: Standard autoregressive language modeling loss (Cross-Entropy).
Purpose: Increase probability of preferred response relative to rejected response (DPO).

Formally: L_DPO = -E[log sigma(beta * log(pi(yw|x)/ref(yw|x)) - beta * log(pi(yl|x)/ref(yl|x)))]

Adaptation: Full fine-tuning of the LLM backbone and projector

Training Data:

OmniAlign-V: 200K total samples
Knowledge QAs: 39K
Inferential QAs: 37K
Creative QAs: 10K
Instruction-Following QAs: 38K
Infographic QAs: 44K
Detail QAs: 35K
DPO Data: Generated via reject sampling using LLaVA-Next baseline as generator and LLM as judge

Key Hyperparameters:

learning_rate: 2e-5 (InternLM), 1e-5 (Qwen)
batch_size: 128
epochs: 1
+ 2 more
max_length: 4096 (InternLM) / 32768 (Qwen)
beta_dpo: 0.1

Compute: 8x A100 GPUs for InternLM2.5-7B training; 32x A100 GPUs for Qwen2.5-32B training

Comparison to Prior Work

vs. LLaVA-Next-778K: OmniAlign-V uses stricter semantic filtering and focuses on open-ended/complex alignment rather than simple VQA
vs. Qwen2-VL: Achieves comparable or better alignment with significantly less training data (200K vs massive proprietary sets) and smaller model size (32B vs 72B)
vs. RLHF-V [not cited in paper]: OmniAlign-V uses DPO with synthetic preference pairs derived from reject sampling rather than human feedback or PPO

Limitations

Heavy reliance on GPT-4o for data synthesis limits scalability and introduces potential bias from the teacher model
Evaluation primarily focuses on English language tasks
No direct human evaluation of the training set quality, though the benchmark is human-annotated

Reproducibility

Code: https://github.com/PhoenixZ810/OmniAlign-V

Code, datasets, and checkpoints are publicly available at https://github.com/PhoenixZ810/OmniAlign-V. The paper details the data synthesis pipeline including specific tools (IC9600, RAM) and prompt strategies (seed questions, few-shot selection).

📊 Experiments & Results

Evaluation Setup

Evaluation of human preference alignment and foundational capabilities using SFT and DPO trained models

Benchmarks:

MM-AlignBench (Human preference alignment (Open-ended VQA)) [New]
WildVision (Human preference alignment (In-the-wild images))
MMMU (Foundational Multi-discipline Reasoning)
OCRBench (Optical Character Recognition)
MathVista (Visual Math Reasoning)

Metrics:

Win Rate vs GPT-4o
Score (WildVision specific metric)
Accuracy/F1 (Foundational benchmarks)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SFT with OmniAlign-V significantly improves alignment metrics compared to the LLaVA-Next-778K baseline while preserving or improving foundational skills.
WildVision	Score	32.5	46.1	+13.6
MM-AlignBench	Win Rate	8.7	23.4	+14.7
MMMU	Accuracy	37.2	38.8	+1.6
OCRBench	Score	787	795	+8
DPO training provides additional alignment gains over SFT alone.
MM-AlignBench	Win Rate	23.4	25.4	+2.0
WildVision	Score	46.1	48.2	+2.1

Main Takeaways

Mixing text-only alignment data into MLLM training is insufficient and can harm multi-modal capabilities; specialized multi-modal data is required.
OmniAlign-V significantly boosts human preference alignment (win-rates vs GPT-4o) without suffering from catastrophic forgetting on standard benchmarks like MMMU.
A rigorous image filtering pipeline (IC9600 + RAM) is crucial for selecting semantically rich images that support complex, open-ended questions.
DPO with synthetic preference pairs generated via reject sampling further enhances alignment over SFT alone.

📚 Prerequisite Knowledge

Prerequisites

Visual Instruction Tuning
Supervised Fine-Tuning (SFT)
Direct Preference Optimization (DPO)
Transformer-based LLM architectures

Key Terms

MLLM: Multi-Modal Large Language Model—an AI system capable of processing and generating both text and images

SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs to teach it how to follow instructions

DPO: Direct Preference Optimization—an alignment algorithm that optimizes a model to prefer 'winning' responses over 'losing' ones without needing a separate reward model

IC9600: An Image Complexity assessment model used to score images based on visual clutter and detail

RAM: Recognize Anything Model—a computer vision model used to tag and identify objects within an image

MMMU: A massive multi-discipline multi-modal understanding benchmark requiring expert-level knowledge

WildVision: A benchmark for evaluating MLLMs on diverse, wild (real-world) vision-language tasks

LLaVA-Next: An improved architecture for LLaVA (Large Language and Vision Assistant) enabling better image resolution handling and logic

OCR: Optical Character Recognition—converting text within images into machine-readable text