← Back to Paper List

MMInstruct: a high-quality multi-modal instruction tuning dataset with extensive diversity

Yangzhou Liu, Yue Cao, Zhangwei Gao, Weiyun Wang, Zhe Chen, Wenhai Wang, Hao Tian, Lewei Lu, Xizhou Zhu, Tong Lu, Yu Qiao, Jifeng Dai
Tsinghua University
Science China Information Sciences (2024)
MM Benchmark Factuality

📝 Paper Summary

Visual Instruction Tuning Multi-Modal Dataset Construction
MMInstruct improves Vision Large Language Models by fine-tuning them on a high-quality, diverse dataset generated via a semi-automatic engine that leverages GPT-4V for detailed image captioning and GPT-3.5 for instruction synthesis.
Core Problem
Existing visual instruction tuning datasets suffer from limited image diversity (often restricted to COCO), poor annotation quality causing hallucinations, and a narrow range of instruction types.
Why it matters:
  • Models trained on limited scenes (e.g., COCO) struggle to generalize to real-world scenarios like text-oriented OCR images.
  • Data generation pipelines relying on rudimentary annotations or weak seed questions introduce noise and hallucinations into VLLMs.
  • Manual construction of diverse, high-quality datasets is prohibitively expensive for large scales.
Concrete Example: Models trained on standard datasets struggle to process text-oriented OCR images because the underlying training images lack text diversity. Furthermore, instructions generated from simple bounding box annotations often hallucinate details not present in the image.
Key Novelty
Semi-Automatic Instruction Generation Data Engine
  • Replaces rudimentary image annotations with detailed, domain-specific semantic captions generated by GPT-4V to ground instruction generation.
  • Utilizes a 'seed question' strategy where experts design domain-specific templates that serve as references, encouraging GPT-3.5 to generate diverse instruction-answer pairs.
  • Combines automated generation with manual correction to ensure quality while reducing costs to 1/6th of fully manual annotation.
Evaluation Highlights
  • LLaVA-1.5 fine-tuned on MMInstruct achieves a score of 1626.2 on the MME benchmark, surpassing the baseline LLaVA-1.5 by 94.9 points.
  • On LLaVA-Bench (In-the-Wild), the model scores 74.5, outperforming the LLaVA-1.5 baseline by 3.8 points.
  • Achieves state-of-the-art performance on 10 out of 12 evaluated benchmarks compared to LLaVA-1.5 trained on standard datasets.
Breakthrough Assessment
8/10
Significant contribution to data engineering for VLLMs. The cost-effective pipeline addresses key bottlenecks (diversity/hallucination) and yields SOTA results on major benchmarks, though the underlying model architecture remains standard.
×