An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Parameter-Efficient Fine-Tuning (PEFT) Vision-Language Alignment

This empirical study evaluates four PEFT methods across three MLLMs, finding that Adapter generally outperforms LoRA, IA3, and Prefix-Tuning in stability, generalization, and hallucination reduction, especially when connector layers are also fine-tuned.

Core Problem

Full fine-tuning of Multimodal LLMs is computationally prohibitive, but the optimal strategy for applying Parameter-Efficient Fine-Tuning (PEFT) to MLLMs—specifically regarding connector layers, module location, and data scale—remains unclear.

Why it matters:

MLLMs introduce visual encoders and connector layers absent in standard LLMs, complicating the transfer of existing PEFT best practices
Blindly applying LLM fine-tuning strategies to multimodal tasks often leads to suboptimal performance on unseen datasets or catastrophic forgetting on seen datasets
Hallucination remains a critical issue in MLLMs, and different fine-tuning methods impact model faithfulness differently

Concrete Example: When fine-tuning Qwen-VL-Chat on a seen dataset like OKVQA, tuning the connector layers causes a significant performance deterioration compared to freezing them, whereas on unseen datasets, tuning the connector usually improves results.

Key Novelty

Comprehensive Empirical Benchmarking of MLLM PEFT

Systematically isolates the impact of fine-tuning connector layers (the bridge between vision and text) versus freezing them across different PEFT methods
Evaluates the trade-off between model stability and trainable parameter count, revealing that fewer parameters do not always guarantee stability in multimodal settings
Investigates the correlation between specific PEFT methods and the rate of hallucination in downstream multimodal tasks

Architecture

The architecture of MLLMs and the specific insertion points for LoRA, IA3, Adapter, and Prefix-Tuning.

Evaluation Highlights

Adapter achieves the lowest hallucination rate (13.3% average) compared to Prefix-Tuning (which increases hallucinations by ~24%) on the Flickr30k benchmark
Fine-tuning connector layers with IA3 yields a ~15.0% average performance increase on unseen datasets compared to freezing the connector
LoRA requires the 'Both' placement (Attention + MLP layers) to match the performance that Adapter achieves using only the 'MLP' layer placement on LLaVA-1.5-7B

Breakthrough Assessment

4/10

A solid empirical study that establishes best practices (e.g., use Adapters for stability/hallucination) but does not propose a novel architecture or algorithm. Valuable for practitioners.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Instruction Tuning where a pre-trained MLLM is fine-tuned on vision-language tasks using limited trainable parameters

Inputs: Multimodal inputs consisting of an image $I$ and a text instruction $T$

Outputs: Textual response $R$ generated by the LLM

Pipeline Flow

Visual Encoder (processes image)
Connector Layers (projects visual features)
LLM with PEFT Modules (processes text + visual features)

System Modules

Visual Encoder

Extract visual features from input images

Model or implementation: CLIP-ViT-L/336px (for LLaVA) or ViT-BigG (for Qwen)

Connector Layers

Align visual features with the LLM's embedding space

Model or implementation: MLP Projection (LLaVA/ShareGPT4V) or specific connector for Qwen

Large Language Model

Generate text response based on instructions and visual tokens

Model or implementation: Vicuna-v1.5 (7B/13B) or Qwen-7B

Modeling

Base Model: LLaVA-1.5 (7B, 13B), ShareGPTv4 (7B), Qwen-VL-Chat (7B)

Training Method: Supervised Fine-Tuning (SFT) with PEFT

Objective Functions:

Purpose: Standard Causal Language Modeling loss.

Formally: Minimize negative log-likelihood of the next token given context.

Adaptation: Compared LoRA, IA3, Adapter, and Prefix-Tuning

Trainable Parameters: Varied by method (e.g., LoRA rank, Adapter bottleneck size)

Training Data:

Unseen: ScienceQA, VizWiz, IconQA, Flickr30k
Seen: OKVQA, OCRVQA, VQAv2

Key Hyperparameters:

global_batch_size: 128
epochs: 3
LoRA_learning_rate: 2e-4
+ 4 more
Adapter_learning_rate: 5e-5
IA3_learning_rate: 2e-4
Prefix_Tuning_learning_rate: 1e-5
seed: 42

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLaMA-adapter: This paper benchmarks standard Adapter modules (Houlsby et al.) rather than the specific LLaMA-adapter architecture
vs. Full Fine-Tuning (FFT): This paper focuses strictly on parameter-efficient methods due to the high cost of FFT for MLLMs

Limitations

Only evaluated on 7B and 13B models; did not scale to larger models like 70B
Focused primarily on VQA and Captioning tasks; did not explore reasoning-heavy tasks extensively
Did not evaluate the 'Unfreeze Visual Encoder' setting extensively in the main results

Reproducibility

Code: https://github.com/alenai97/PEFT-MLLM.git

Code and data are publicly available at https://github.com/alenai97/PEFT-MLLM.git. Hyperparameters for all four PEFT methods are explicitly listed.

📊 Experiments & Results

Evaluation Setup

Evaluated on 7 datasets split into 'Seen' (used in pre-training) and 'Unseen' categories to test generalization and forgetting.

Benchmarks:

ScienceQA (Visual Question Answering)
VizWiz (Visual Question Answering)
IconQA (Visual Reasoning)
Flickr30k (Image Captioning)
OKVQA (Visual Question Answering)
OCRVQA (Visual Question Answering)
VQAv2 (Visual Question Answering)

Metrics:

Accuracy (Acc)
CIDEr (for Flickr30k)
Hallucination Rate (via MMHAL-Bench)
Statistical methodology: Reported standard deviations across 3 random seeds (21, 42, 63) for stability analysis.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on unseen datasets with LLaVA-1.5-13B shows the benefit of fine-tuning connectors.
ScienceQA (img)	Accuracy	69.05	72.48	+3.43
VizWiz	Accuracy	50.81	52.82	+2.01
Hallucination analysis on Flickr30k showing Adapter's superiority.
Flickr30k	Hallucination Rate	54.2	11.1	-43.1
Flickr30k	Hallucination Rate	34.7	11.1	-23.6
Module location ablation study on LLaVA-1.5-7B (ScienceQA).
ScienceQA	Accuracy	65.57	69.17	+3.60
ScienceQA	Accuracy	69.25	70.15	+0.90

Experiment Figures

Evaluation loss curves for different PEFT methods on ScienceQA (SQA) across training epochs.

Generalization performance of models fine-tuned on one source domain and tested on target domains at different overfitting points.

Main Takeaways

Adapter consistently outperforms LoRA, IA3, and Prefix-Tuning in terms of stability, generalization, and reducing hallucinations, making it the recommended PEFT method for MLLMs.
Fine-tuning connector layers significantly boosts performance on unseen datasets (up to 15% with IA3) but can degrade performance on seen datasets due to forgetting.
Data scale matters: all PEFT methods benefit from high-resource settings, but medium-resource datasets are a viable efficient alternative.
Parameter efficiency trade-off: Fewer trainable parameters generally help maintain performance on seen datasets (mitigating forgetting) but hinder adaptation to unseen datasets.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architectures (Attention, FFN)
Familiarity with PEFT methods (LoRA, Adapter, Prefix-Tuning)
Basics of Multimodal LLM architecture (Visual Encoder, Connector, LLM)

Key Terms

MLLM: Multimodal Large Language Model—a system combining a visual encoder and an LLM to process image-text inputs

PEFT: Parameter-Efficient Fine-Tuning—methods to fine-tune large models by updating only a small subset of parameters

Connector Layers: Neural network layers (often MLPs) that project visual features from the encoder into the token embedding space of the LLM

LoRA: Low-Rank Adaptation—a PEFT method that injects trainable low-rank decomposition matrices into pre-trained weights

IA3: Infused Adapter by Inhibiting and Amplifying Inner Activations—a PEFT method scaling activation vectors with learned vectors

Seen vs. Unseen Datasets: Distinction made to check for contamination; 'Seen' are used in pre-training, 'Unseen' are strictly novel to the model

Hallucination: The generation of text that is not grounded in the provided image or contradicts facts

Adapter: A specific PEFT module inserted sequentially after attention and FFN layers (bottleneck architecture)