MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

📝 Paper Summary

Visual Instruction Tuning Synthetic Data Generation

MM-Instruct leverages LLMs to generate diverse visual instructions and coherent answers from image captioning datasets, significantly improving LMM performance on creative and complex tasks beyond standard QA.

Core Problem

Existing visual instruction datasets focus heavily on basic question-answering or captioning, causing Large Multimodal Models (LMMs) to fail at real-world tasks requiring creativity, summarization, or complex analysis.

Why it matters:

Current models perform well on benchmarks but fail user requests in real scenarios (e.g., 'write a poem about this image' or 'summarize the event')
Manually collecting diverse, high-quality instruction data is prohibitively expensive and hard to scale for academic groups
Standard image captioning datasets lack the textual diversity needed to train robust instruction-following capabilities

Concrete Example: When asked to 'Write a news report about the event in the image,' LLaVA-1.5 often fails to adopt the requested format or style, merely describing the image content factually. The proposed LLaVA-Instruct generates a structured news report with a headline and narrative style.

Key Novelty

LLM-driven augmentation of caption datasets into complex instructions

Uses a text-only LLM (ChatGPT) to brainstorm diverse instructions based on detailed image descriptions, rather than relying on human annotation or simple templates
Employs a retrieval-based pipeline where images are matched to these synthetic instructions via CLIP, ensuring visual relevance without needing paired data initially
Generates answers using a strong LLM grounded by detailed textual descriptions of the image, ensuring the reasoning trace aligns with visual content

Architecture

The automated pipeline for constructing the MM-Instruct dataset, divided into Instruction Generation (top) and Instance Generation (bottom).

Evaluation Highlights

LLaVA-Instruct-13B outperforms LLaVA-1.5-13B significantly on VizWiz (+219.78 score) and MME (+101.46 score) benchmarks despite using generated data
According to GPT-4V judging, LLaVA-Instruct-7B produces equally or more preferable responses in 72% of cases compared to the base LLaVA-1.5-7B model
Outperforms LLaVA-1.5 on 9 out of 12 evaluated vision-language benchmarks, showing that diverse instruction tuning improves general perception

Breakthrough Assessment

7/10

Strong pragmatic contribution demonstrating that synthetic instruction data from text-only LLMs can significantly boost LMM alignment and general performance, though the core architecture remains standard LLaVA.

⚙️ Technical Details

Problem Definition

Setting: Visual Instruction Tuning of Large Multimodal Models

Inputs: Image I and textual instruction X_ins

Outputs: Textual response X_ans

Pipeline Flow

Instruction Generation (ChatGPT)
Instruction Refinement (Clustering & Summarization)
Instruction-Image Matching (CLIP)
Answer Generation (GPT-4 & Mistral)
Data Filtering (Heuristics & LLM verification)

System Modules

Instruction Generator (Data Construction)

Generate diverse instruction templates based on image descriptions and seed examples

Model or implementation: ChatGPT (text-only)

Instruction Refiner (Data Construction)

Reduce redundancy and improve generality of instructions

Model or implementation: k-means clustering + ChatGPT summarization

Image Matcher (Data Construction)

Pair generated instructions with relevant images from large databases

Model or implementation: CLIP ViT-L

Answer Generator (Data Construction)

Produce high-quality responses for the pairs

Model or implementation: GPT-4 (Stage 1) + Mistral-8x7b (Stage 2)

Modeling

Base Model: LLaVA-1.5 (Vicuna v1.5 7B/13B + CLIP ViT-L/336px)

Training Method: Supervised Fine-Tuning (Instruction Tuning)

Training Data:

234k generated instruction-answer pairs (MM-Instruct)
Combined with original LLaVA-1.5 instruction data

Key Hyperparameters:

epochs: 1
batch_size: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. LLaVA-1.5: MM-Instruct focuses on diverse tasks (creative writing, analysis) vs. standard QA/conversation
vs. ShareGPT4V: MM-Instruct uses text-only LLMs + matching for scalability vs. costly GPT-4V calls for every image
vs. LVIS-Instruct4V: MM-Instruct derives instructions from open caption datasets via retrieval vs. specific object detection datasets

Limitations

Relies on the quality of initial image captions (from CogVLM/CapsFusion) for grounding
Instruction matching via CLIP may occasionally produce mismatched pairs (mitigated by filtering)
Evaluation relies heavily on GPT-4V as a judge, which may have its own biases
Method does not introduce new model architecture, only data

Reproducibility

Code: https://github.com/jihaonew/MM-Instruct

Data, benchmark, and models available at https://github.com/jihaonew/MM-Instruct. Uses open-source models (LLaVA-1.5, Mistral-8x7b) and commercial APIs (ChatGPT, GPT-4) for data generation. Hyperparameters follow LLaVA-1.5 defaults.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on standard benchmarks + GPT-4V based evaluation on a held-out set

Benchmarks:

VizWiz (Visual Question Answering (Blind/Low Vision context))
MME (Comprehensive LMM Perception Benchmark)
MM-Instruct Benchmark (Instruction Following Evaluation) [New]
MM-Vet (Integrated Capabilities)
LLaVA-Bench (Conversational Evaluation)

Metrics:

Accuracy
Score (Benchmark specific)
Win Rate (vs Baseline via GPT-4V)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Standard benchmark evaluations show LLaVA-Instruct consistently improving over the LLaVA-1.5 baseline, with particularly large gains on VizWiz and MME.
MME	Score	1531.30	1632.76	+101.46
MM-Vet	Score	35.4	37.1	+1.7
LLaVA-Bench (In-the-Wild)	Score	72.4	83.9	+11.5
GPT-4V preference evaluation on the new MM-Instruct benchmark demonstrates superior instruction-following capabilities.
MM-Instruct Benchmark	Win/Tie Rate vs Gemini-Pro	50.0	60.0	+10.0

Experiment Figures

Win rates of different LMMs on the MM-Instruct benchmark as judged by GPT-4V.

Ablation studies on data size, instruction source, and filtering.

Main Takeaways

Diverse instruction data improves general perception: Outperforms baseline on 9/12 traditional benchmarks without specifically targeting them.
Scaling matters: 13B models show larger gains in instruction-following win rates compared to 7B models.
Data filtering is crucial: Ablation shows that removing the filtering step drops performance, highlighting the noise in synthetic generation.
Cost-effective generation: The pipeline uses open-source LLMs (Mistral) for the bulk of generation, proving high-quality data can be created without exclusive reliance on proprietary giants like GPT-4V.

📚 Prerequisite Knowledge

Prerequisites

Large Multimodal Models (LMMs) like LLaVA
Instruction Tuning
CLIP embeddings
Prompt engineering for synthetic data generation

Key Terms

LMM: Large Multimodal Model—a neural network capable of processing and generating both text and images (e.g., GPT-4V, LLaVA)

Instruction Tuning: Fine-tuning a pre-trained model on a dataset of (instruction, output) pairs to improve its ability to follow user commands

CLIP: Contrastive Language-Image Pre-training—a model that learns to map images and text to a shared embedding space, allowing for metric-based matching between them

Visual Grounding: The ability of a model to link textual concepts (e.g., 'red ball') to specific regions or objects in an image

LLaVA: Large Language and Vision Assistant—an open-source LMM architecture that connects a vision encoder (like CLIP) to an LLM (like Vicuna)

k-means: A clustering algorithm that partitions data points into k groups based on similarity

ROUGE-L: A metric for evaluating text generation by measuring the longest common subsequence between the generated text and a reference summary

VQA: Visual Question Answering—a task where a model must answer a natural language question about an image

In-context learning: Providing a model with a few examples of a task within the prompt to guide its generation without updating its weights