Parameter-efficient fine-tuning in large language models: a survey of methodologies

📝 Paper Summary

Parameter-Efficient Fine-Tuning (PEFT) Large Language Models (LLMs) Multimodal Learning

This survey provides a comprehensive taxonomy and review of over 100 Parameter-Efficient Fine-Tuning (PEFT) methods across NLP, computer vision, and multimodal domains, categorizing them into additive, reparameterized, subtractive, and hybrid approaches.

Core Problem

Fine-tuning entire Large Language Models (LLMs) for specific downstream tasks is computationally prohibitive and memory-intensive due to their massive parameter scale (billions to trillions).

Why it matters:

Training large models like Llama-3.1 405B from scratch requires massive energy (e.g., 40 million GPU hours), raising environmental and cost concerns
Full fine-tuning is impractical on consumer hardware, limiting accessibility for researchers and smaller organizations
Existing surveys often focus narrowly on NLP or lack coverage of recent advancements in multimodal and diffusion models

Concrete Example: Fine-tuning a 65B parameter model like LLaMA typically requires high-end multi-GPU clusters. Without PEFT, adapting such a model to a medical question-answering task would require updating all 65B weights, whereas PEFT might only update 1-2% of parameters, making it feasible on much smaller hardware.

Key Novelty

Unified PEFT Taxonomy across Modalities

Expands the traditional PEFT classification (additive, reparameterized, subtractive) to include hybrid, quantization, and multi-task categories
Breaks traditional domain boundaries by reviewing PEFT applications not just in NLP, but extensively in computer vision, multimodal fusion, and diffusion models
Systematically analyzes over 100 papers from June 2019 to July 2024 to identify gaps and future directions like trustworthy PEFT and automated PEFT design

Architecture

Mainstream architecture of multimodal large language models (MLLMs) composed of Multimodal Encoder, LLM, and Modal Connector

Evaluation Highlights

Review covers over 100 research articles published from June 2019 to July 2024
Identifies that PEFT methods typically update only 1-2% of total parameters while maintaining performance comparable to full fine-tuning
highlights that PEFT can reduce computational costs from ~4 million GPU hours (full SFT) to 400K GPU hours or less

Breakthrough Assessment

8/10

A highly comprehensive survey that bridges the gap between NLP-focused PEFT reviews and emerging multimodal applications. While it doesn't propose a new algorithm, its taxonomy and breadth are valuable for the community.

⚙️ Technical Details

Problem Definition

Setting: Adapting pre-trained large models (Transformer-based) to downstream tasks with minimal parameter updates

Inputs: Pre-trained model parameters $\theta$, Downstream task dataset $D$

Outputs: Updated parameters $\theta' = \theta + \Delta\theta$, where $\Delta\theta$ is sparse or low-rank

Pipeline Flow

Taxonomy Definition (Additive, Subtractive, Reparameterized, Hybrid)
Methodology Review (NLP, Vision, Multimodal)
Application Analysis

System Modules

Additive PEFT (Taxonomy Category)

Introduces new trainable parameters to the model

Model or implementation: Adapters, Prefix Tuning, Prompt Tuning

Reparameterized PEFT (Taxonomy Category)

Optimizes a low-rank transformation of the weight updates

Model or implementation: LoRA (Low-Rank Adaptation)

Subtractive PEFT (Taxonomy Category)

Removes or masks parameters to reduce redundancy

Model or implementation: Pruning, Masking

Novel Architectural Elements

Classification framework integrating Quantization and Multi-tasking into standard PEFT taxonomy
Extension of PEFT categorization to Multimodal and Diffusion model architectures

Modeling

Base Model: Covers various LLMs (GPT-series, LLaMA-series, DeepSeek, Mistral, PaLM) and Vision models (ViT, CLIP)

Training Method: Survey of multiple methods (LoRA, Adapter, Prefix Tuning, P-Tuning, BitFit)

Objective Functions:

Purpose: Minimize prediction error on downstream task with regularization on parameter updates.

Formally: $\min_{\Delta\theta} L(D, \theta + \Delta\theta) + \lambda R(\Delta\theta)$

Adaptation: Various (Low-Rank, Adapter layers, Soft Prompts)

Trainable Parameters: Typically 0.01% to 10% of total parameters depending on method

Compute: Highlights reduction from 4M GPU hours (Full SFT) to <400K GPU hours (PEFT) for large scale models

Comparison to Prior Work

vs. [23] (Delta Tuning Survey): This survey expands beyond NLP to include extensive coverage of Computer Vision and Multimodal domains.
vs. [24] (Scaling & Efficiency Survey): This survey includes more recent methods (up to July 2024) and broader application scenarios beyond just efficiency metrics.
vs. [26] (Visual PEFT Survey): This survey integrates Visual PEFT into a broader context including LLMs and Multimodal models, rather than focusing solely on vision.

Limitations

Survey scope cuts off at July 2024, potentially missing very recent late-2024 developments
Focus is broad, so depth on individual algorithmic implementation details (like CUDA kernel optimizations for LoRA) is limited
Evaluation comparisons are aggregated from original papers, which may have inconsistent experimental settings

Reproducibility

Survey paper; references original codebases for methods like LoRA (Microsoft), AdapterHub, etc. Does not introduce a new codebase itself.

📊 Experiments & Results

Evaluation Setup

Review of experimental results reported in primary literature across NLP, Vision, and Multimodal tasks

Benchmarks:

GLUE/SuperGLUE (NLU (Natural Language Understanding))
MMLU (Multi-task Language Understanding)
VTAB (Visual Task Adaptation)

Metrics:

Accuracy
Parameter Efficiency (% of trainable params)
Training/Inference Cost
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Theoretical estimation	GPU Hours	4000000	400000	-3600000
General PEFT literature	Trainable Parameters (%)	100	1.5	-98.5

Experiment Figures

The three-step workflow of Reinforcement Learning from Human Feedback (RLHF)

Main Takeaways

PEFT methods (LoRA, Adapters) achieve performance comparable to full fine-tuning while modifying only a tiny fraction of parameters.
The field is moving towards 'Unified PEFT' frameworks that can handle multi-modal inputs (text, image, audio) simultaneously.
Parameter efficiency is critical for sustainability, significantly reducing the carbon footprint of AI development.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention mechanisms)
Basics of Transfer Learning and Fine-tuning
Linear Algebra (Rank, Matrix Decomposition)

Key Terms

PEFT: Parameter-Efficient Fine-Tuning—methods to adapt large models by updating only a small set of parameters

LoRA: Low-Rank Adaptation—a PEFT method that freezes pre-trained weights and injects trainable rank decomposition matrices into each layer

Adapter Tuning: Inserting small trainable neural network modules (adapters) between layers of a pre-trained model while freezing the original weights

Prefix Tuning: Prepending a sequence of continuous, trainable vectors (prefixes) to the input or hidden states to steer the model's generation

Prompt Tuning: Learning soft prompt embeddings that are concatenated to the input text, similar to discrete prompts but optimized via gradient descent

BitFit: A PEFT method that fine-tunes only the bias terms of the model

Delta Tuning: A theoretical framework representing the change in parameters ($\Delta\theta$) during fine-tuning

RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using rewards derived from human preferences

CoT: Chain-of-Thought—a prompting strategy that encourages the model to generate intermediate reasoning steps

MoE: Mixture of Experts—a model architecture where different 'expert' sub-networks are activated for different inputs to improve efficiency

SFT: Supervised Fine-Tuning—training a model on labeled examples (instruction-output pairs)

Diffusion Models: Generative models that learn to reverse a noise-adding process to generate data (e.g., images) from noise