FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated Learning

📝 Paper Summary

Personalized Federated Learning (PFL) Vision Transformers (ViTs)

FedPerfix personalizes Vision Transformers in federated learning by keeping only the most sensitive layers (self-attention and classification head) local and enhancing them with stable, learnable prefix plugins.

Core Problem

Existing partial personalization methods are designed for CNNs and don't account for the specific architecture of Vision Transformers (ViTs), where different layers exhibit varying sensitivity to data heterogeneity.

Why it matters:

One-model-fits-all FL fails on non-IID data, necessitating personalization to handle client heterogeneity.
Full model personalization is resource-intensive; partial personalization is efficient but requires knowing exactly where and how to personalize.
ViTs outperform CNNs in many tasks, but their application and personalization in federated learning remain under-explored compared to CNNs.

Concrete Example: When aggregating a ViT model across clients with label skew (e.g., different class distributions), the self-attention mechanism's weights may become averaged and generic, losing the ability to attend to client-specific features efficiently.

Key Novelty

Federated Personalized Prefix-tuning (FedPerfix)

Empirically identifies that self-attention layers and classification heads in ViTs are the most sensitive to data distribution, making them ideal targets for personalization.
Uses 'Prefix' plugins (learnable vectors appended to attention keys/values) as personalization modules to adapt the global model to local data without modifying shared weights.
Stabilizes prefix training using a local adapter mechanism (parallel attention) to prevent instability caused by random initialization.

Architecture

Comparison of Vanilla Attention, Vanilla Prefix-tuning, and the proposed FedPerfix architecture for self-attention layers.

Evaluation Highlights

Outperforms state-of-the-art methods by +3.22% accuracy on CIFAR-100 (non-IID) compared to the next best method (APFL).
Achieves superior performance while reducing communication costs by ~2% compared to full model aggregation baselines like FedAvg.
Demonstrates robustness across varying levels of data heterogeneity and client participation rates, maintaining a +3-4% lead over baselines.

Breakthrough Assessment

7/10

Provides a solid empirical analysis of ViT layer sensitivity in FL and successfully adapts parameter-efficient transfer learning techniques (Prefixes) to PFL, outperforming CNN-centric baselines.

⚙️ Technical Details

Problem Definition

Setting: Classification on decentralized datasets D separated on N clients with non-IID distributions Pi.

Inputs: Input images x from local datasets

Outputs: Top-1 classification accuracy on local test sets

Pipeline Flow

Server broadcasts global ViT parameters (u) to clients
Clients plug in local parameters (v_i): Classification Head and Prefixes (generated by local Adapters)
Clients train local parameters and shared parameters on local data
Clients send updated shared parameters (u_i) back to server; local parameters (v_i) stay on client
Server aggregates shared parameters via FedAvg

System Modules

Global ViT Backbone

Captures general visual features common across all clients

Model or implementation: ViT-Small (patch size 16)

Local Personalization Module (FedPerfix) (Personalization)

Adapts the self-attention mechanism to local data distribution using Prefixes

Model or implementation: Parallel Adapter (Linear -> Tanh -> Linear)

Local Classification Head (Personalization)

Maps features to class predictions specific to local label distribution

Model or implementation: Linear Layer

Novel Architectural Elements

Integration of Prefix-tuning specifically into the PFL framework for ViTs
Use of a parallel adapter (scaling down -> activation -> scaling up) to generate and stabilize prefixes instead of using free parameters

Modeling

Base Model: ViT-Small (ViT-S/16)

Training Method: Federated Learning with partial aggregation (SGD optimizer)

Objective Functions:

Purpose: Minimize classification error on local data.

Formally: Cross-entropy loss on local dataset D_i

Adaptation: Prefix-tuning via parallel adapters

Trainable Parameters: Global backbone (aggregated), Local Head + Adapters (kept local)

Training Data:

CIFAR-100 (64 clients, alpha=0.1)
OrganAMNIST (64 clients, alpha=0.5)
Office-Home (16 clients, alpha=1.0)

Key Hyperparameters:

communication_rounds: 50
local_epochs: 10
batch_size: 64
+ 3 more
learning_rate: 0.01 (typically)
client_participation_rate: 0.125 (CIFAR/OrganAMNIST), 0.25 (Office-Home)
image_size: 224x224

Compute: Storage: 24.42M params (116% of base); FLOPs: 66.58M (101% of base)

Comparison to Prior Work

vs. FedRep: Personalizes self-attention mechanism via prefixes in addition to the head
vs. FedBN: Targets attention layers rather than normalization layers, found to be more sensitive in ViTs
vs. APFL: Uses parameter-efficient plugins rather than maintaining two full models, reducing communication/storage cost

Limitations

Slight increase in local storage (116%) compared to FedAvg due to additional local parameters
Performance gains decrease as model size increases (ViT-Base vs ViT-Tiny)
Only evaluated on image classification tasks

Reproducibility

Code: https://github.com/imguangyu/FedPerfix

Code is publicly available on GitHub. Hyperparameters for datasets (alpha, clients) and model (ViT-S) are specified. Hardware used: 4 Nvidia A5000 GPUs.

📊 Experiments & Results

Evaluation Setup

Simulated federated learning with non-IID data partitions

Benchmarks:

CIFAR-100 (Natural Image Classification)
OrganAMNIST (Medical Image Classification)
Office-Home (Domain Adaptation / Object Recognition)

Metrics:

Top-1 Classification Accuracy (Mean and Std Dev across clients)
Statistical methodology: Mean and standard deviation reported over all clients

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on CIFAR-100 (high heterogeneity) shows FedPerfix outperforming all baselines.
CIFAR-100	Top-1 Accuracy	44.88	48.10	+3.22
CIFAR-100	Top-1 Accuracy	23.29	48.10	+24.81
Results on OrganAMNIST (medical) and Office-Home (domain shift) confirm consistent improvements.
OrganAMNIST	Top-1 Accuracy	92.63	93.17	+0.54
Office-Home	Top-1 Accuracy	24.23	24.38	+0.15
Ablation study demonstrates the effectiveness of the specialized initialization module.
CIFAR-100	Top-1 Accuracy	46.98	48.10	+1.12

Experiment Figures

Density plot of per-client performance gain relative to Local training on CIFAR-100.

Main Takeaways

Identifying sensitive layers (Self-Attention + Head) is crucial for effective partial personalization in ViTs.
Prefix-tuning is a more effective personalization mechanism for ViTs than simply fine-tuning heads or normalization layers.
The method is robust to extreme data heterogeneity and low client participation rates.
Simply replacing CNNs with ViTs in existing PFL methods improves performance, but FedPerfix (ViT-specific design) yields the best results.

📚 Prerequisite Knowledge

Prerequisites

Federated Learning (FedAvg algorithm)
Vision Transformer (ViT) architecture (Self-attention, patches)
Parameter-Efficient Fine-Tuning (specifically Prefix-tuning/Adapters)

Key Terms

PFL: Personalized Federated Learning—training client-specific models rather than a single global model to handle heterogeneous data

Partial model personalization: Updating only specific parameters (e.g., heads, normalization) locally while aggregating others, reducing cost and preserving privacy

ViT: Vision Transformer—a neural network architecture that processes images as sequences of patches using self-attention mechanisms

Prefixes: Learnable vectors appended to the Key and Value matrices in self-attention layers to steer the model's behavior without changing its weights

label skew: A type of non-IID data where the distribution of labels varies across clients (e.g., Client A has only cats, Client B has only dogs)

concept skew: A type of non-IID data where the same label looks different across clients (e.g., 'dog' in photos vs. 'dog' in sketches)

non-IID: Non-Independent and Identically Distributed—data distributions that differ between clients

Parallel adapter: A small neural network module (down-projection, activation, up-projection) used here to generate prefixes stably

FedAvg: Federated Averaging—the standard algorithm where a server averages model weights from multiple clients

APFL: Adaptive Personalized Federated Learning—a method that mixes a global model and a local model using an adaptive coefficient