mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

📝 Paper Summary

Multi-modal Foundation Models Vision-Language Pre-training Video-Language Pre-training

mPLUG-2 introduces a modularized architecture that shares universal layers for collaboration while using modality-specific modules to prevent entanglement, enabling flexible unification of text, image, and video tasks.

Core Problem

Existing multi-modal foundation models either use a single network for all modalities, causing interference (modality entanglement), or rely heavily on separate encoders, limiting collaboration.

Why it matters:

Jointly training on diverse modalities (text, image, video) often leads to negative interference where improving one modality degrades another
Rigid architectures cannot flexibly select sub-modules for specific downstream tasks, leading to computational inefficiency during inference
Balancing the gain from modality collaboration against the noise from modality entanglement is difficult in single-module architectures

Concrete Example: A single Transformer trained on both static images and dynamic videos might struggle because spatial patterns in images interfere with the temporal motion patterns required for video understanding. mPLUG-2 solves this by using a shared spatial module but a separate 'local temporal modeling' module specifically for video frames.

Key Novelty

Modularized Multi-modal Foundation Model

Separates the network into modality-specific encoders (handling entanglement) and shared universal layers (handling collaboration) to balance distinctiveness and synergy
Introduces a Dual-vision Encoder that shares spatial processing for images/video but adds specific local temporal modules for video dynamics
Uses a Universal Layers Module as a pivot to project vision and language into a shared semantic space while maintaining original modality features via cross-attention

Architecture

The overall mPLUG-2 framework and detailed schematics of the Dual-vision Encoder and Universal Layers modules.

Evaluation Highlights

Achieves 48.0% Top-1 Accuracy on MSRVTT Video QA, surpassing the previous state-of-the-art by 0.6% despite smaller data scale
Sets new SOTA on MSRVTT Video Captioning with 80.3 CIDEr score, outperforming GIT2 (75.9) and VideoCoCa (73.2)
Matches or exceeds SOTA on ImageNet-1K classification (88.5% Top-1) among general-purpose foundation models without using ImageNet-21K or JFT pre-training data

Breakthrough Assessment

8/10

Significantly advances multi-modal unification by effectively solving the modality entanglement vs. collaboration trade-off. The modular design delivers SOTA performance across video, image, and text tasks simultaneously.

⚙️ Technical Details

Problem Definition

Setting: Unified pre-training and fine-tuning across uni-modal (text, image, video) and cross-modal (image-text, video-text) tasks

Inputs: Combinations of Image I, Video V, and Text T sequences

Outputs: Task-dependent outputs: Class labels (classification), Generated text (captioning/QA), or Similarity scores (retrieval)

Pipeline Flow

Modality-Specific Encoders (Text, Image/Video)
Universal Layers (Modality Alignment)
Fusion Module (Cross-modal Interaction)
Shared Decoder (Generation)

System Modules

Dual-vision Encoder (Encoding)

Extract visual features from images or video frames

Model or implementation: Transformer with disentangled spatial/temporal blocks

Text Encoder (Encoding)

Extract contextual text representations

Model or implementation: BERT-style encoder

Universal Layers Module

Project vision and text into shared semantic space while preserving original features

Model or implementation: Transformer layers with self-attention and cross-attention back to original features

Fusion Module

Deeply fuse text and vision representations for understanding tasks

Model or implementation: Stacked Transformer blocks with Cross-Attention

Shared Decoder

Generate text for captioning, QA, and summarization

Model or implementation: Transformer Decoder

Novel Architectural Elements

Dual-vision Encoder: Disentangles spatial (shared) and temporal (video-specific) representations within a single encoder block structure
Universal Layers Module: A shared pivot module that connects modality-specific encoders to fusion/decoding modules, allowing flexible module selection during inference
Modular composition: The architecture explicitly allows selecting different subsets of modules (e.g., bypassing Fusion for uni-modal tasks) based on input type

Modeling

Base Model: mPLUG-2 (Custom Transformer architecture)

Training Method: Joint Pre-training with Instruction-based Learning

Objective Functions:

Purpose: Learn text representation.

Formally: Masked Language Modeling (MLM)
Purpose: Align vision and language.

Formally: Cross-modal Matching Losses (CML) including Vision-Language Matching (VLM) and Contrastive Learning (VLC)
Purpose: Unify generation tasks.

Formally: Instruction-based Language Model Loss (generating text based on task instructions)

Training Data:

Image-Text: MS COCO, Visual Genome, Conceptual Captions (3M/12M), SBU Captions (Total ~14M images)
Video-Text: WebVid-2M (2.5M pairs)
Text-only: WikiCorpus (20GB), Common Crawl (350GB)

Key Hyperparameters:

pre_training_data_scale: 17M image/video-text pairs
model_size_parameter_count: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Flamingo: mPLUG-2 is modular and handles video/image/text in one architecture, whereas Flamingo focuses on vision-to-text generation
vs. BEiT-3: mPLUG-2 uses specific modules for entanglement issues rather than a single massive shared transformer
vs. OFA: mPLUG-2 incorporates a Dual-vision encoder and Universal layers for better modality collaboration compared to OFA's rigid sequence-to-sequence framework
+ 1 more
vs. HiTeA: mPLUG-2 outperforms HiTeA on video tasks with similar data scale due to the dual-vision encoder's local temporal modeling

Limitations

Computational cost of pre-training not explicitly analyzed
Model size parameters not explicitly listed in the main text comparison tables
Relies on standard public datasets, performance on highly domain-specific data untested

Reproducibility

Code: https://github.com/alibaba/AliceMind

Code and models are publicly available at https://github.com/alibaba/AliceMind. Pre-training datasets are standard public datasets (COCO, VG, CC, WebVid, etc.). Detailed hyperparameters for pre-training (learning rate, batch size, GPU hours) are not explicitly provided in the main text.

📊 Experiments & Results

Evaluation Setup

Pre-training followed by fine-tuning on over 30 downstream tasks

Benchmarks:

MSRVTT / MSVD / TGIF-FrameQA (Video QA and Retrieval)
MSCOCO / Flickr30k (Image-Text Retrieval and Captioning)
VQA v2 (Visual Question Answering)
GLUE (Natural Language Understanding)
ImageNet-1K / Kinetics-400 (Vision-only Classification)

Metrics:

Top-1 Accuracy
Recall@1/5/10
CIDEr
BLEU@4
ROUGE-L
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Video Understanding Results: mPLUG-2 demonstrates significant gains in video QA and captioning, setting new state-of-the-art results even against larger models.
MSRVTT-QA	Top-1 Accuracy	47.4	48.0	+0.6
MSRVTT Caption	CIDEr	75.9	80.3	+4.4
LSMDC Retrieval	Recall@1	28.7	34.4	+5.7
Image-Text Results: mPLUG-2 remains competitive or superior on established image-text benchmarks.
COCO Caption	CIDEr	136.7	137.7	+1.0
VQA v2	test-std Accuracy	80.50	81.13	+0.63
Vision-Only and Language-Only Results: The model generalizes well to uni-modal tasks without losing performance.
ImageNet-1K	Top-1 Accuracy	87.8	88.5	+0.7
GLUE (Average)	Score	92.6	92.7	+0.1

Main Takeaways

Modular design allows state-of-the-art performance across disparate modalities (text, image, video) simultaneously without performance degradation from negative transfer
Dual-vision encoder effectively captures temporal dynamics in video, evidenced by large gains in video retrieval and captioning tasks
Universal layers facilitate strong zero-shot and few-shot transfer by aligning modalities in a shared semantic space
Data efficiency is high; mPLUG-2 outperforms models trained on billions of samples (like Flamingo, GIT2) using only ~17M pairs

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Self-Attention, Cross-Attention)
Vision-Language Pre-training (VLP) objectives
BERT-style masked language modeling

Key Terms

Modality Entanglement: The phenomenon where training on multiple modalities (e.g., text and video) causes interference that degrades performance on individual tasks

Universal Layers: Network layers shared across all tasks that project different modalities into a common semantic space for alignment

Dual-vision Encoder: A vision encoder architecture that processes both images and videos by sharing spatial distinct layers while using separate temporal distinct layers

CIDEr: Consensus-based Image Description Evaluation—a metric for evaluating image/video captioning quality by comparing generated sentences to human references

Instruction-based Learning: Using explicit text instructions (e.g., 'generate caption') to guide the model to perform specific tasks using the same weights