MM-LLMs: Recent Advances in MultiModal Large Language Models

📝 Paper Summary

MultiModal Large Language Models (MM-LLMs) Model Architecture Survey

This survey establishes a comprehensive taxonomy of 126 MM-LLMs and defines a general five-component architecture to organize and facilitate research in the rapidly expanding field of multimodal large language models.

Core Problem

The rapid expansion of MM-LLMs and datasets has made traditional MM models computationally expensive to train from scratch, while the field lacks a unified architectural framework to organize the heterogeneous approaches.

Why it matters:

Training multimodal models from scratch incurs substantial computational costs
Effectively connecting pre-trained LLMs with other modalities (audio, vision) remains a core challenge
The lack of a unified taxonomy hinders researchers from tracking the timeline and specific formulations of the exploding number of new models

Concrete Example: Traditional MM models require massive compute; MM-LLMs mitigate this by keeping the heavy LLM backbone frozen or using PEFT (Parameter-Efficient Fine-Tuning), training only lightweight projectors (taking up ~2% of parameters) to align modalities.

Key Novelty

Unified General Architecture & Taxonomy for MM-LLMs

Proposes a general model architecture consisting of five components: Modality Encoder, Input Projector, LLM Backbone, Output Projector, and Modality Generator
Establishes a taxonomy encompassing 126 State-of-the-Art MM-LLMs, categorizing them by their specific architectural formulations and capabilities (e.g., MM Understanding vs. MM Generation)

Architecture

The general model architecture of MM-LLMs showing the five core components and the data flow between them.

Evaluation Highlights

Taxonomy categorizes 126 distinct MM-LLMs (e.g., BLIP-2, LLaVA, GPT-4 Vision)
Identifies a standard training pipeline consisting of MM Pre-Training (PT) for alignment and MM Instruction-Tuning (IT) for intent alignment
Defines a modular architecture where trainable parameters are typically around 2% of the total count, enabling cost-effective training

Breakthrough Assessment

9/10

A highly necessary and comprehensive survey that organizes a chaotic field. The definition of the general 5-component architecture provides a standard language for future research.

⚙️ Technical Details

Problem Definition

Setting: Enabling Large Language Models (LLMs) to process and generate multimodal content (image, video, audio, 3D)

Inputs: Multimodal inputs I_X (where X is image, video, audio, etc.) and text t

Outputs: Textual response t and/or multimodal signal tokens S_X to guide generation

Pipeline Flow

Input Processing: Modality Encoder -> Input Projector
Reasoning: LLM Backbone
Output Generation: Output Projector -> Modality Generator

System Modules

Modality Encoder (ME) (Input Processing)

Encodes raw inputs (image, video, audio) into feature vectors

Model or implementation: Varies (e.g., CLIP ViT, SAM-HQ for images; HuBERT, Whisper for audio; ImageBind for unified inputs)

Input Projector (Input Processing)

Aligns encoded features F_X with the text feature space T

Model or implementation: Linear Projector, MLP, Cross-attention, or Q-Former

LLM Backbone

Processes text and aligned prompts to perform reasoning and decision making

Model or implementation: Off-the-shelf LLMs (e.g., LLaMA, Vicuna, Flan-T5, ChatGLM)

Output Projector (Output Generation)

Maps LLM signal tokens S_X into features H_X for the generator

Model or implementation: Tiny Transformer or MLP

Modality Generator (MG) (Output Generation)

Synthesizes final multimodal output (image, video, audio)

Model or implementation: Latent Diffusion Models (e.g., Stable Diffusion, Zeroscope, AudioLDM-2)

Novel Architectural Elements

Unified five-component architecture generalizing across 126 models
Use of 'Signal Tokens' S_X generated by the LLM to trigger and condition external Modality Generators

Modeling

Base Model: Varies (survey covers 126 models including LLaMA, Vicuna, Flan-T5, etc.)

Training Method: Two-stage pipeline: MM Pre-Training (PT) + MM Instruction-Tuning (IT)

Objective Functions:

Purpose: Align input modality features with text space.

Formally: Minimize X-conditioned text generation loss L_txt-gen using aligned prompts P_X.
Purpose: Align output features with generator's conditional space.

Formally: Minimize conditional LDM loss L_X-gen = E[ || epsilon - epsilon_theta(z_t, t, H_X) ||^2 ].

Adaptation: PEFT (Parameter-Efficient Fine-Tuning) methods like LoRA, Prefix-tuning (often <0.1% trainable parameters)

Trainable Parameters: Typically ~2% of total parameters (Input/Output Projectors are lightweight)

Training Data:

X-Text datasets {I_X, t} for alignment
Instruction-following datasets for IT

Compute: Significantly reduced compared to training from scratch; trainable parameters often limited to lightweight projectors.

Comparison to Prior Work

vs. End-to-end training (e.g., Flamingo): MM-LLMs typically freeze the heavy LLM backbone and train only lightweight projectors, reducing cost [not cited in paper]
vs. Tool-augmented LLMs (e.g., Visual-ChatGPT): End-to-end MM-LLMs (like NExT-GPT) mitigate propagated errors inherent in cascading systems by using differentiable projectors

Limitations

Survey relies on the state of the field as of early 2024; rapid progress may outdate the taxonomy quickly (addressed via website)
Performance benchmarks are reviewed but rely on the heterogeneity of underlying evaluation datasets and metrics used by original authors
Specific quantitative results for all 126 models are not provided in the summary text snippet

Reproducibility

Code: https://mm-llms.github.io

The authors maintain a GitHub repository (https://mm-llms.github.io) for real-time tracking of the surveyed models. Specific code for the 126 surveyed models is generally available via their respective original papers referenced in the taxonomy.

📊 Experiments & Results

Evaluation Setup

Review of performance on mainstream benchmarks for MM Understanding and Generation

Benchmarks:

MME (Multimodal Evaluation)
MMBench (Multimodal Benchmark)
MMMU (Massive Multi-discipline Multimodal Understanding)

Metrics:

Varies by task (Accuracy, CIDEr, CLIP Score, etc.)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The field has shifted from MM understanding (e.g., BLIP-2, LLaVA) to MM generation (e.g., MiniGPT-5) and finally to any-to-any conversation (e.g., NExT-GPT).
Model architectures consistently follow the 5-component design: Encoder -> Projector -> LLM -> Projector -> Generator.
Training consistently relies on a two-stage pipeline: Pre-Training for feature alignment and Instruction-Tuning for user intent alignment.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture
Large Language Models (LLMs)
Multimodal learning (Vision-Language models)
Diffusion Models (for generation)

Key Terms

MM-LLMs: MultiModal Large Language Models—models that extend LLMs to support inputs or outputs of other modalities (images, audio) alongside text

PEFT: Parameter-Efficient Fine-Tuning—techniques like LoRA or Prefix-tuning that fine-tune only a small subset of parameters to reduce computational cost

ICL: In-Context Learning—the ability of a model to perform tasks based on examples provided in the prompt without parameter updates

Modality Encoder: A component that encodes inputs from diverse modalities (images, audio) into feature representations

Input Projector: A module that aligns encoded features from other modalities with the text feature space of the LLM

Output Projector: A module that maps LLM signal tokens into features understandable by the Modality Generator

Modality Generator: A component (often a diffusion model) that synthesizes content in distinct modalities based on features from the Output Projector

LDM: Latent Diffusion Model—a type of generative model used for synthesizing high-quality images or audio

Q-Former: A transformer-based input projector that extracts relevant features from encoded inputs using learnable queries (used in BLIP-2)

MM PT: MultiModal Pre-Training—the first training stage focused on aligning modality features

MM IT: MultiModal Instruction-Tuning—the second training stage focused on aligning the model with human intent