MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Visual Instruction Tuning

MM1.5 is a family of multimodal models (1B to 30B) that achieves strong performance in text-rich and fine-grained visual tasks through a meticulous three-stage data mixing strategy.

Core Problem

Developing performant MLLMs is highly empirical, and the precise impact of data mixtures across different training stages (especially for diverse capabilities like OCR, grounding, and multi-image reasoning) remains under-explored.

Why it matters:

Current open-source models often lack robust visual referring and grounding capabilities compared to proprietary models like GPT-4o
The specific trade-offs between different data categories (e.g., how adding multi-image data affects single-image performance) are not well-documented
Efficient scaling of MLLMs to small sizes (1B-3B) for mobile deployment while maintaining high performance is a critical challenge

Concrete Example: A user asks 'What can I make with these ingredients?' pointing to specific items in an image. Standard models might list generic recipes, but MM1.5 can identify the specific ingredients via coordinates, ground its response with bounding boxes, and reason about the combined items.

Key Novelty

Comprehensive Data-Centric MLLM Training Recipe (MM1.5)

Introduces a high-resolution 'continual pre-training' stage with OCR data, bridging the gap between coarse pre-training and fine-grained SFT
Optimizes SFT data mixtures by explicitly balancing competing capabilities (general, text-rich, knowledge, grounding) through extensive ablation studies rather than random mixing
Implements 'dynamic image splitting' (AnyRes) to handle arbitrary aspect ratios and high resolutions (up to 4 Megapixels), maximizing OCR and detail retention

Architecture

Overview of the MM1.5 architecture highlighting three key capabilities: single-image understanding with dynamic splitting, multi-image reasoning, and visual referring/grounding.

Evaluation Highlights

MM1.5-3B-Chat achieves 62.6 MMBase score, outperforming larger open-source models like LLaVA-NeXT-8B (60.6)
MM1.5-30B-Chat achieves 86.6 on MMBench and 65.2 on MMMU, competitive with GPT-4o (69.1 on MMMU) and Gemini 1.5 Pro
On text-rich benchmarks (DocVQA), MM1.5-30B reaches 91.0, surpassing GPT-4V (88.4) and approaching GPT-4o (92.8)

Breakthrough Assessment

8/10

While the architecture is standard, the rigorous empirical study of data mixtures and the resulting high performance at small scales (1B/3B) provide a highly valuable recipe for the community.

⚙️ Technical Details

Problem Definition

Setting: Multimodal generative modeling where the model accepts interleaved text and images (single or multiple) and generates text, optionally with bounding box coordinates

Inputs: Sequence of text tokens and image inputs (potentially high-resolution, variable aspect ratio)

Outputs: Text response, potentially containing bounding box coordinates <x1, y1, x2, y2> for grounding

Pipeline Flow

Input Processing (Dynamic Image Splitting + Text Tokenization)
Visual Encoding (CLIP + C-Abstractor)
LLM Processing (Decoder-only Transformer)
Output Generation (Text + Coordinates)

System Modules

Dynamic Image Splitter

Divides high-res images into a grid of sub-images (plus a global overview) to preserve details

Model or implementation: Algorithm based on aspect ratio

Vision Encoder (Visual Encoding)

Encodes image patches into visual embeddings

Model or implementation: CLIP-L/14 (336px) equivalent (in-house)

VL Connector (Visual Encoding)

Compresses and projects visual features into the LLM's token space

Model or implementation: C-Abstractor

LLM Backbone

Generates text and coordinates autoregressively

Model or implementation: Dense (1B, 3B, 7B, 30B) or MoE (1B, 3B with 64 experts)

Novel Architectural Elements

Integration of dynamic image splitting with C-Abstractor connector for flexible high-resolution encoding within the MM1 architecture

Modeling

Base Model: MM1 (Dense: 1B, 3B, 7B, 30B; MoE: 1B, 3B with 64 experts)

Training Method: Multi-stage training: Pre-training → Continual Pre-training → Supervised Fine-tuning (SFT)

Training Data:

Pre-training: 2B image-text pairs, 600M interleaved docs, text-only data (ratio 50:10:40)
Continual PT: 45M high-res OCR data (PDFA, IDL, Rendered-text, DocStruct-4M)
SFT: Mixture of ~3M examples across General, Text-Rich, Knowledge, Referring, Multi-image, Text-only

Key Hyperparameters:

batch_size: 256
learning_rate: 1e-5 (peak)
scheduler: cosine decay
+ 3 more
optimizer: AdaFactor
image_resolution_continual_pt: 1344x1344
sft_epochs: 1

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLaVA-OneVision: MM1.5 includes native visual referring and grounding capabilities without set-of-mark prompting
vs. Cambrian-1: MM1.5 emphasizes a data-centric recipe for scaling down to 1B/3B parameters while maintaining high performance
vs. GPT-4o: MM1.5 is an open-methodology family (though weights/data are not fully open) offering specialized UI and Video variants

Limitations

Dynamic image splitting increases computational cost due to more tokens per image
Synthetic captions for continual pre-training showed mixed results and were not included in final recipe
Performance on multi-image tasks trades off slightly with single-image capabilities depending on data mixture
Pre-training data (2B images) is not public, limiting full reproducibility

Reproducibility

SFT data mixtures and ratios are detailed in the paper. Pre-training datasets are proprietary/internal (e.g., '2B image-text pairs'). Code and model weights availability is not explicitly stated as 'public' in the paper text.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on a diverse set of multimodal benchmarks grouped by capability.

Benchmarks:

MMBench (General Multimodal QA)
DocVQA (Document Visual QA)
MMMU (Multidisciplinary Knowledge/Reasoning)
RefCOCO (Visual Grounding)
MathVista (Visual Math Reasoning)

Metrics:

Accuracy
MMBase Score (Average of General, Text-Rich, Knowledge scores)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of MM1.5-30B-Chat against state-of-the-art proprietary and open-source models demonstrates competitive performance.
MMBench	Accuracy	83.4	86.6	+3.2
DocVQA	Accuracy	92.8	91.0	-1.8
MMMU	Accuracy	69.1	65.2	-3.9
Evaluation of small-scale models shows MM1.5-1B/3B outperforming similar-sized competitors.
MMBench	Accuracy	70.0	73.2	+3.2
Ablation study on Continual Pre-training resolution proves high resolution is critical.
MMBase Score	Score	58.28	60.26	+1.98

Experiment Figures

Bar chart showing the impact of adding different SFT data categories (Math, Science, Code, Grounding) to the 'General' baseline.

Ablation of Continual Pre-training (CPT) resolution and data sources.

Main Takeaways

High-resolution continual pre-training with OCR data is essential for boosting text-rich image understanding without hurting general capabilities.
Data mixing for SFT is sensitive; a ratio of 2.0 for Referring & Grounding data (relative to General) is optimal for enabling grounding without degrading other tasks.
Dynamic image splitting (AnyRes) consistently outperforms static splitting, especially for document tasks (DocVQA), with larger n_max yielding better results.
Small models (1B/3B) can achieve SOTA performance for their size class through careful data curation, outperforming larger poorly-optimized models.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture and Vision Transformers (ViT)
Visual Instruction Tuning (SFT)
Mixture-of-Experts (MoE)
OCR (Optical Character Recognition)

Key Terms

OCR: Optical Character Recognition—technology to extract text from images

SFT: Supervised Fine-Tuning—training a pre-trained model on labeled instruction-following data

MoE: Mixture-of-Experts—a model architecture where different sub-models (experts) are activated for different inputs, allowing high capacity with lower inference cost

Continual Pre-training: An intermediate training stage between large-scale pre-training and SFT, often used to inject specific domains or capabilities (like high-res OCR here)

Dynamic Image Splitting: A technique (also known as AnyRes) where an image is divided into a variable grid of sub-images based on its aspect ratio to preserve resolution

C-Abstractor: A vision-language connector module that compresses visual features into a fixed number of tokens for the LLM

MMBase score: An aggregate metric defined in this paper averaging performance across General, Text-Rich, and Knowledge benchmark categories