MMInstruct: a high-quality multi-modal instruction tuning dataset with extensive diversity

📝 Paper Summary

Visual Instruction Tuning Multi-Modal Dataset Construction

MMInstruct improves Vision Large Language Models by fine-tuning them on a high-quality, diverse dataset generated via a semi-automatic engine that leverages GPT-4V for detailed image captioning and GPT-3.5 for instruction synthesis.

Core Problem

Existing visual instruction tuning datasets suffer from limited image diversity (often restricted to COCO), poor annotation quality causing hallucinations, and a narrow range of instruction types.

Why it matters:

Models trained on limited scenes (e.g., COCO) struggle to generalize to real-world scenarios like text-oriented OCR images.
Data generation pipelines relying on rudimentary annotations or weak seed questions introduce noise and hallucinations into VLLMs.
Manual construction of diverse, high-quality datasets is prohibitively expensive for large scales.

Concrete Example: Models trained on standard datasets struggle to process text-oriented OCR images because the underlying training images lack text diversity. Furthermore, instructions generated from simple bounding box annotations often hallucinate details not present in the image.

Key Novelty

Semi-Automatic Instruction Generation Data Engine

Replaces rudimentary image annotations with detailed, domain-specific semantic captions generated by GPT-4V to ground instruction generation.
Utilizes a 'seed question' strategy where experts design domain-specific templates that serve as references, encouraging GPT-3.5 to generate diverse instruction-answer pairs.
Combines automated generation with manual correction to ensure quality while reducing costs to 1/6th of fully manual annotation.

Evaluation Highlights

LLaVA-1.5 fine-tuned on MMInstruct achieves a score of 1626.2 on the MME benchmark, surpassing the baseline LLaVA-1.5 by 94.9 points.
On LLaVA-Bench (In-the-Wild), the model scores 74.5, outperforming the LLaVA-1.5 baseline by 3.8 points.
Achieves state-of-the-art performance on 10 out of 12 evaluated benchmarks compared to LLaVA-1.5 trained on standard datasets.

Breakthrough Assessment

8/10

Significant contribution to data engineering for VLLMs. The cost-effective pipeline addresses key bottlenecks (diversity/hallucination) and yields SOTA results on major benchmarks, though the underlying model architecture remains standard.

⚙️ Technical Details

Problem Definition

Setting: Visual Instruction Tuning (Supervised Fine-Tuning for Vision-Language Models)

Inputs: Image I and natural language instruction T

Outputs: Natural language response R

Pipeline Flow

Vision Encoder (CLIP) processes Image
Projector maps visual features to language space
LLM (Vicuna) processes mapped features + Text Instruction
Generation of Response

System Modules

Vision Encoder (Input Processing)

Extract visual features from the input image

Model or implementation: CLIP-ViT-L-336px (Frozen)

Projector (Input Processing)

Align visual features with the LLM's embedding space

Model or implementation: Two-layer MLP

Large Language Model

Generate text response conditioned on visual tokens and text instruction

Model or implementation: Vicuna-v1.5 (7B/13B)

Modeling

Base Model: LLaVA-1.5 (Vicuna-v1.5 + CLIP-ViT-L-336px)

Training Method: Visual Instruction Tuning (Supervised Fine-Tuning)

Objective Functions:

Purpose: Predict the next text token conditioned on image and instruction.

Formally: Autoregressive language modeling loss.

Adaptation: Full fine-tuning of the LLM and Projector (standard LLaVA protocol)

Trainable Parameters: LLM weights and MLP projector weights

Training Data:

161K Images collected via web crawling and similarity search
973K Instructions generated via GPT-3.5 based on GPT-4V captions
24 Domains (Perception, Reasoning, VQA)

Key Hyperparameters:

learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper
epochs: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLaVA-1.5: MMInstruct uses GPT-4V generated captions instead of existing COCO annotations, reducing hallucination.
vs. LRV-Instruction: MMInstruct covers a broader range of 24 domains including OCR and complex reasoning, not just hallucination mitigation.
vs. M3IT: MMInstruct emphasizes instruction diversity via 'seed questions' rather than just scaling up existing academic datasets.

Limitations

The paper does not explicitly report training hyperparameters (learning rate, batch size, epochs) for the fine-tuning experiments.
Reliance on commercial APIs (GPT-4V, GPT-3.5) creates a dependency on closed-source models for data generation.
Manual correction is still required (though reduced), which may scale linearly with dataset size.

Reproducibility

Code: https://github.com/yuecao0119/MMInstruct

Code and data are publicly available at https://github.com/yuecao0119/MMInstruct. The paper describes the data engine costs in detail ($0.00885 per image caption, $0.0004 per instruction generation) but omits specific training hyperparameters (LR, batch size) for the LLaVA fine-tuning experiments.

📊 Experiments & Results

Evaluation Setup

Visual Instruction Tuning of LLaVA-1.5 using MMInstruct, followed by evaluation on standard VLLM benchmarks.

Benchmarks:

MME (Comprehensive multimodal evaluation)
LLaVA-Bench (In-the-Wild) (Real-world image understanding)
MMBench (Multimodal perception and reasoning)
MM-Vet (Integrated multimodal capabilities)

Metrics:

Score (Benchmark specific)
Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Fine-tuning on MMInstruct consistently improves performance over the LLaVA-1.5 baseline across a wide variety of benchmarks.
MME	Total Score	1531.3	1626.2	+94.9
LLaVA-Bench (In-the-Wild)	Score	70.7	74.5	+3.8
MMBench	Accuracy	67.7	68.2	+0.5
MM-Vet	Score	35.4	36.2	+0.8
POPE	Accuracy	85.9	87.2	+1.3

Experiment Figures

Radar chart comparing LLaVA-1.5 (Baseline) vs. LLaVA-1.5 (MMInstruct) across multiple benchmarks.

Main Takeaways

Fine-tuning on MMInstruct achieves new state-of-the-art performance on 10 out of 12 benchmarks compared to the strong LLaVA-1.5 baseline.
The dataset construction method (Data Engine) reduces costs to 1/6th of manual annotation while maintaining high quality.
Using GPT-4V for detailed captioning significantly helps in generating diverse instructions and reducing hallucinations compared to using raw image annotations.
The inclusion of 24 diverse domains, including OCR and reasoning, improves the model's ability to generalize to complex, real-world tasks.

📚 Prerequisite Knowledge

Prerequisites

Vision Large Language Models (VLLMs)
Instruction Tuning
Contrastive Language-Image Pre-training (CLIP)

Key Terms

VLLM: Vision Large Language Model—AI models that can see images and understand text instructions to generate responses

OCR: Optical Character Recognition—the conversion of images of typed, handwritten, or printed text into machine-encoded text

Hallucination: A phenomenon where a model generates plausible-sounding but incorrect or factually baseless information

SFT: Supervised Fine-Tuning—training a model on labeled examples to improve its performance on specific tasks

Seed Question: A representative question template designed by experts to guide the automated generation of diverse instructions

MME: A comprehensive evaluation benchmark for Multimodal Large Language Models

LLaVA-Bench: A benchmark assessing VLLM performance on challenging, real-world images (In-the-Wild)

GPT-4V: GPT-4 with Vision—a multimodal model capable of analyzing image inputs