← Back to Paper List

On Domain-Adaptive Post-Training for Multimodal Large Language Models

Daixuan Cheng, Shaohan Huang, Ziyu Zhu, Xintong Zhang, Wayne Xin Zhao, Zhongzhi Luan, Bo Dai, Zhenliang Zhang
Beijing Institute for General Artificial Intelligence, Beihang University, Tsinghua University, Beijing Institute of Technology, Renmin University of China
arXiv (2024)
MM Pretraining Benchmark

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Domain Adaptation Synthetic Data Generation
AdaMLLM adapts general multimodal models to specialized domains by synthesizing high-quality visual instructions using only open-source models and employing a simplified single-stage post-training pipeline.
Core Problem
General MLLMs perform poorly in specialized domains (e.g., biomedicine, remote sensing) due to insufficient training data, while existing adaptation methods rely on privacy-sensitive closed-source models or complex two-stage training.
Why it matters:
  • Scientific and industrial fields require expertise on specialized images not found in general web data.
  • Privacy constraints often prohibit sending sensitive domain data to closed-source APIs like GPT-4V for annotation.
  • Two-stage training (image-caption alignment followed by instruction tuning) limits task diversity and reduces efficiency in data-scarce domains.
Concrete Example: In biomedicine, a general MLLM might describe a chest X-ray generally but fail to answer specific diagnostic questions. Current methods either require sending patient data to GPT-4 (privacy risk) or training in two stages, which segregates captioning knowledge from QA reasoning.
Key Novelty
AdaMLLM (Adapted Multimodal Large Language Model)
  • Generate-then-filter pipeline: Fine-tunes an open-source MLLM to synthesize diverse instruction-response pairs from domain image-captions, then filters them using a consistency check between 'precise' and 'informative' outputs.
  • Single-stage post-training: Combines the original image-captioning task with the synthetic visual instruction task into one training stage, rather than the traditional two-stage separation, to preserve task diversity.
Evaluation Highlights
  • AdaMLLM (8B) outperforms LLaVA-Med (created with GPT-4) on biomedical VQA tasks (e.g., +4.6% on VQA-RAD compared to LLaVA-Med).
  • Achieves superior performance in food and remote sensing domains compared to baselines using strong closed-source models like GPT-4V and GPT-4o.
  • Single-stage training consistently beats two-stage training (e.g., +2.0 average score improvement in Biomedicine) when using high-quality synthetic data.
Breakthrough Assessment
7/10
Strong practical contribution demonstrating that open-source models can generate high-quality synthetic data for domain adaptation, surpassing closed-source baselines. The shift to single-stage training simplifies the standard pipeline effectively.
×