MIMIC-IT: Multi-Modal In-Context Instruction Tuning

📝 Paper Summary

Vision-Language Models (VLMs) Instruction Tuning In-Context Learning

MIMIC-IT is a large-scale dataset of 2.8 million multi-modal instruction-response pairs designed to train vision-language models to perceive, reason, and plan using images and videos as context.

Core Problem

Existing vision-language instruction datasets are limited in visual diversity (mostly single images like COCO), lack video support, and rely solely on language for in-context information rather than multi-modal context.

Why it matters:

Current assistants fail when users provide multiple images or videos as context (e.g., 'compare these two photos')
Zero-shot generalization requires diverse, high-quality instructions that mirror real-world visual complexity beyond simple object recognition
Models need to understand context (user intent, tone, style) through visual examples, not just text descriptions

Concrete Example: In LLaVA-Instruct, a model only sees one image and text examples. In MIMIC-IT, a user can upload two images and ask 'What is the difference between these two images?' or provide a video clip and ask 'Is it safe to walk on the floor while the woman is cleaning?' requiring temporal and comparative reasoning.

Key Novelty

Multi-Modal In-Context Instruction Tuning (MIMIC-IT)

Introduces multi-modal in-context examples: instead of just text Q&A examples, the model receives context consisting of images/videos + text pairs to learn the task pattern
Syphus: An automated pipeline using ChatGPT to generate instruction-response pairs based on visual annotations (bounding boxes, captions) and system messages defining tone/style
Supports arbitrary visual inputs (multiple images or video clips) within a single instruction cycle, enabling tasks like 'spot the difference' or egocentric video planning

Architecture

The Syphus automated instruction generation pipeline

Evaluation Highlights

Otter model achieves highest Elo rating (1014.7) on Multi-Modality Arena, outperforming LLaVA and OpenFlamingo in human evaluation
+6.8% accuracy improvement over VideoChatGPT on MSVD zero-shot video question answering
Superior few-shot learning: Otter outperforms OpenFlamingo by ~14 CIDEr points on COCO captions in the 4-shot setting

Breakthrough Assessment

9/10

Significant scale-up (2.8M pairs) and structural innovation (multi-modal in-context). Addresses key gaps in video/multi-image understanding. Performance gains are substantial across diverse benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal instruction tuning with in-context examples

Inputs: A query tuple (Iq, Xq) consisting of an instruction and visual data (images/video), accompanied by in-context examples (Ik, Rk, Xk)

Outputs: A generated response Rq

Pipeline Flow

Syphus Pipeline (Data Generation): Visual Annotations + System Message -> ChatGPT -> Instruction-Response Pairs
Otter Model (Inference): Visual Inputs + Text Instructions + Multi-modal In-context examples -> OpenFlamingo Architecture -> Response

System Modules

Syphus (Pipeline)

Generate high-quality instruction-response pairs from visual metadata

Model or implementation: ChatGPT / GPT-4

Otter (Model)

Process multi-modal inputs to generate text responses

Model or implementation: OpenFlamingo (customized)

Novel Architectural Elements

Integration of multi-modal in-context information directly into the instruction tuning format (input includes context images/videos, not just context text)
Support for video data treated as ordered sequences of images within the instruction-tuning framework

Modeling

Base Model: OpenFlamingo (based on LLaMA-7B and CLIP ViT-L/14)

Training Method: Instruction Tuning (Supervised Fine-Tuning)

Objective Functions:

Purpose: Maximize likelihood of response given instruction and visual context.

Formally: Standard language modeling loss conditioned on visual inputs.

Training Data:

Total: ~2.8M instruction-response pairs
Unique instructions: 2.2M
Sources: COCO, Spot-the-Diff, ScanNetV2, Visual Storytelling, DenseCaption, TVCaption, Ego4D
Languages: 8 (English, Chinese, Spanish, Japanese, French, German, Korean, Arabic)

Key Hyperparameters:

base_model_size: 7B (LLaMA)
vision_encoder: CLIP ViT-L/14

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLaVA: MIMIC-IT supports multiple images and videos, not just single images; MIMIC-IT uses multi-modal in-context examples, LLaVA uses language-only context
vs. MiniGPT-4: MIMIC-IT is significantly larger (2.8M vs 5K/3.5K) and includes diverse scene types (egocentric, indoor, TV shows)
vs. VideoChatGPT: MIMIC-IT covers both image and video domains with specific instruction types for planning and reasoning
+ 1 more
vs. Multi-Instruct [not cited in paper]: Multi-Instruct focuses on diverse existing tasks; MIMIC-IT focuses on generative instructions via self-instruct method

Limitations

Reliance on ChatGPT for data generation can introduce hallucinations or incorrect responses
Video handling is frame-based (image sequences), potentially missing fine-grained temporal dynamics
Evaluation metrics for generative VLMs are still maturing (reliance on ChatGPT/Human eval)

Reproducibility

Code: https://github.com/Luodian/Otter

Available: MIMIC-IT dataset, Syphus pipeline code, Otter model weights, and benchmarks are released on GitHub. Missing: Exact compute hours for training Otter.

📊 Experiments & Results

Evaluation Setup

Multi-modal perception, reasoning, and in-context learning evaluation

Benchmarks:

MMAGIBench (Perception and Reasoning (General scenes))
Multi-Modality Arena (Human evaluation of helpfulness/alignment)
COCO Caption (Few-shot in-context learning)
MSVD/MSRVTT (Video Question Answering and Captioning)

Metrics:

Accuracy (ChatGPT-evaluated)
Elo Rating
CIDEr
Bleu-4
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Otter achieves top performance on MMAGIBench perception and reasoning tasks evaluated by ChatGPT.
MMAGIBench (Perception - Avg)	Accuracy	62.7	65.5	+2.8
MMAGIBench (Reasoning - Avg)	Accuracy	71.9	66.3	-5.6
Human evaluation via Elo rating shows Otter aligning best with user intent.
Multi-Modality Arena	Elo Rating	1013.2	1014.7	+1.5
Video understanding benchmarks demonstrate significant zero-shot improvements.
MSVD 0-shot QA	Accuracy	38.4	45.2	+6.8
MSRVTT 0-shot QA	Accuracy	27.8	35.3	+7.5
Few-shot experiments confirm the benefit of MIMIC-IT training for in-context learning.
COCO Caption (4-shot)	CIDEr	61.5	75.5	+14.0

Experiment Figures

Comparison of Otter against baselines on Video Understanding (a), Human Evaluation (b), and Few-shot Learning (c)

Main Takeaways

MIMIC-IT significantly enhances zero-shot proficiency in multi-modal perception and reasoning compared to baseline OpenFlamingo and other VLMs.
The model successfully generalizes to video tasks (MSVD/MSRVTT) despite being a VLM, due to the multi-image/video data structure in MIMIC-IT.
In-context learning capabilities are robustly improved; the model can effectively use provided visual and textual context to align responses (e.g., style transfer, format adherence).
Egocentric data inclusion (Ego4D) enables specific 'visual assistant' capabilities like planning and safety analysis not found in general VLM datasets.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs)
Instruction Tuning
In-Context Learning (ICL)
Transformer architectures

Key Terms

MIMIC-IT: MultI-Modal In-Context Instruction Tuning—the proposed dataset featuring 2.8M instruction pairs with multi-modal context

Syphus: The automated pipeline proposed in this paper for generating instruction-response pairs using LLMs (ChatGPT/GPT-4) and visual annotations

Otter: The multi-modal model trained on the MIMIC-IT dataset, based on the OpenFlamingo architecture

In-context learning: The ability of a model to learn a task from a few examples provided in the prompt (context) without parameter updates

Egocentric view: First-person perspective (like looking through someone's eyes), crucial for AR/VR applications

Cold-start: A strategy in the Syphus pipeline where initial in-context examples are manually curated or heuristically generated to guide the LLM before large-scale generation

Hallucination: When a model generates plausible but incorrect or factually baseless information

Elo rating: A comparative ranking system used here to evaluate model performance based on pairwise comparisons of responses