GPFL: Simultaneously Learning Global and Personalized Feature Information for Personalized Federated Learning

📝 Paper Summary

Personalized Federated Learning Feature Representation Learning

GPFL splits the client model into dual pathways using a conditional valve to simultaneously learn global features (guided by shared category embeddings) and personalized features (driven by local tasks) without mutual interference.

Core Problem

Existing Personalized Federated Learning (pFL) methods typically focus on extracting either global or personalized features during local training, failing to achieve both collaborative learning and personalization goals effectively.

Why it matters:

Focusing only on global features (e.g., FedRoD) neglects personalized objectives, while focusing only on personalized features (e.g., FedPer, FedRep) loses global context crucial for collaboration.
Prototype-based methods (e.g., FedProto) rely on high-quality feature extractors to generate prototypes, creating a paradox where poor initial features lead to poor guidance, especially for large backbones.

Concrete Example: In FedProto, prototypes are averages of local features. If a model (like ResNet-18) is untrained, it produces poor features, leading to uninformative prototypes that mislead training. Additionally, FedProto only pulls features to prototypes without pushing them apart, causing class boundaries to intersect (as shown in the paper's t-SNE visualization on Fashion-MNIST).

Key Novelty

GPFL (Global and Personalized Federated Learning)

Introduces a Conditional Valve (CoV) that dynamically transforms a base feature vector into two distinct vectors: one for global alignment and one for personalized tasks.
Utilizes trainable Global Category Embeddings (GCE) shared across clients to guide feature extraction at both magnitude and angle levels, providing stable external information unlike dynamic prototypes.

Architecture

The internal module structure of a client in GPFL, showing the data flow through the feature extractor, Conditional Valve, and dual branches.

Evaluation Highlights

Outperforms state-of-the-art Ditto by 8.99% accuracy on Cifar100 in practical label skew settings (ResNet-18).
Achieves 17.32% higher accuracy than FedProto on Tiny-ImageNet with ResNet-18, demonstrating superior scalability to large backbones.
Maintains privacy integrity (lower privacy leakage PSNR) compared to FedAvg and FedRoD under DLG attacks.

Breakthrough Assessment

8/10

Significantly outperforms SOTA methods (up to ~9-17%) in difficult heterogeneous settings and resolves the conflict between global and personalized objectives via a novel architectural split.

⚙️ Technical Details

Problem Definition

Setting: Personalized Federated Learning with N clients, each having a distinct distribution D_i. The goal is to minimize a weighted sum of personalized objectives F_i.

Inputs: Private data x_i with labels y_i on client i

Outputs: Personalized model parameters {W_1, ..., W_N} for each client

Pipeline Flow

Local Feature Extraction (Backbone)
Conditional Transformation (CoV)
Dual-Branch Training (Global Guidance & Personalized Task)

System Modules

Feature Extractor (phi)

Maps raw input data to a lower-dimensional feature space

Model or implementation: CNN (4-layer) or ResNet-18

Conditional Valve (CoV)

Transforms base features into global-specific and personalized-specific feature vectors via affine mapping

Model or implementation: MLP with LayerNorm and ReLU

Global Category Embedding (GCE) (Dual-Branch Training)

Provides shared global class representations to guide feature learning

Model or implementation: Learnable Embedding Matrix

Personalized Head (psi) (Dual-Branch Training)

Maps personalized features to class logits for the specific client task

Model or implementation: Fully Connected Layers

Novel Architectural Elements

Conditional Valve (CoV) mechanism inserted after the backbone to create distinct 'Global' and 'Personalized' feature routes from a single feature extractor
Use of shared, trainable Global Category Embeddings (GCE) as conditional inputs to the valve and as targets for metric learning losses, replacing dynamic prototypes

Modeling

Base Model: 4-layer CNN (simple tasks), ResNet-18 (complex tasks), HAR-CNN (IoT), fastText/3-layer MLP (NLP)

Training Method: Federated Optimization with custom loss function

Objective Functions:

Purpose: Guide features to align with global class centers (angle).

Formally: L_alg = -log(exp(sim(f_i^G, GCE(y_i))) / sum(exp(sim(f_i^G, GCE(u)))))
Purpose: Guide features to align with global class centers (magnitude).

Formally: L_mlg = ||f_i^G - stop_grad(GCE(y_i))||^2
Purpose: Optimize personalized task performance.

Formally: L_P = CrossEntropy(psi(f_i^P), y_i)
Purpose: Total local loss.

Formally: L_i = L_P + L_alg + lambda*L_mlg

Adaptation: None (End-to-end training)

Trainable Parameters: Backbone weights, CoV weights, GCE embeddings, Personalized Head weights

Training Data:

CV: Fashion-MNIST, Cifar100, Tiny-ImageNet
NLP: AG News, Amazon Review
IoT: HAR dataset
Splits: 75% train, 25% test on each client

Key Hyperparameters:

local_learning_rate: 0.005 (CNN), 0.1 (ResNet-18/fastText), 0.01 (HAR-CNN)
batch_size: 10
local_epochs: 1
+ 2 more
iterations: 2000
client_joining_ratio: 1.0 (default), varied for stability tests

Compute: Not reported in the paper

Comparison to Prior Work

vs. FedPer/FedRep: GPFL learns *both* global and personalized features simultaneously via the CoV split, whereas FedPer/FedRep focus only on personalized.
vs. FedRoD: GPFL allows the feature extractor to receive gradients from *both* global and personalized objectives, unlike FedRoD which isolates extractor training to the global task.
vs. FedProto: GPFL uses learnable, stable embeddings (GCE) rather than dynamic prototypes dependent on feature quality. It also explicitly pushes classes apart (contrastive) rather than just pulling them close.
+ 1 more
vs. FedPHP: GPFL aligns features during training with learnable embeddings, whereas FedPHP relies on aligning with global features from a teacher model which may be poor initially.

Limitations

Requires tuning additional hyperparameters (lambda for magnitude guidance).
Computation overhead of the Conditional Valve and GCE lookup (though claimed to be small).
Performance gain relies on the assumption that global information is beneficial; might struggle if global and local tasks are completely contradictory (though Conditional Valve mitigates this).

Reproducibility

Code: https://github.com/TsingZ0/GPFL

📊 Experiments & Results

Evaluation Setup

Federated Learning with 20-500 clients under statistical heterogeneity.

Benchmarks:

Cifar100 (Image Classification)
Tiny-ImageNet (Image Classification)
Fashion-MNIST (Image Classification)
AG News (Text Classification)
Amazon Review (Sentiment Analysis)
HAR (Human Activity Recognition)

Metrics:

Test Accuracy
Fairness (Standard Deviation of Accuracy)
Privacy (PSNR under DLG attack)
Statistical methodology: Mean and standard deviation over 3 trials reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Computer Vision tasks under Practical Label Skew (Dirichlet distribution beta=0.1).
Cifar100 (Practical Label Skew)	Accuracy	52.87	61.86	+8.99
Tiny-ImageNet (Practical Label Skew)	Accuracy	37.27	43.37	+6.10
Tiny-ImageNet (ResNet-18)	Accuracy	26.38	43.70	+17.32
Performance on NLP and IoT tasks.
AG News (Practical Label Skew)	Accuracy	96.34	97.97	+1.63
HAR (Real World Setting)	Accuracy	91.57	93.76	+2.19

Experiment Figures

t-SNE visualization of feature vectors on Fashion-MNIST for FedPer, FedProto, and GPFL.

Test accuracy and training loss curves on Amazon Review (Feature Shift setting).

Main Takeaways

Consistent superiority across CV, NLP, and IoT domains under label skew, feature shift, and real-world heterogeneity.
Scales effectively to larger backbones (ResNet-18) where methods relying on prototypes (FedProto) or feature alignment (FedPHP) degrade significantly.
Improves fairness (lower standard deviation of accuracy across clients) by sharing global information effectively.
Mitigates overfitting in feature shift settings (Amazon Review) where other pFL methods see accuracy drops after convergence.
Offers better privacy protection against gradient leakage attacks compared to standard FedAvg and FedRoD.

📚 Prerequisite Knowledge

Prerequisites

Federated Learning (FL) basics (FedAvg)
Deep Neural Networks (feature extractors vs. heads)
Metric learning (cosine similarity, contrastive loss)
Conditional computation (dynamic routing/layers)

Key Terms

pFL: Personalized Federated Learning—a variant of FL where each client learns a personalized model instead of a single global model

label skew: A type of statistical heterogeneity where the distribution of labels varies across clients (e.g., one client has mostly cats, another mostly dogs)

feature shift: A type of statistical heterogeneity where the underlying feature distribution varies for the same labels (e.g., photos vs. sketches)

GCE: Global Category Embedding—a learnable matrix of vectors representing class centers, shared across all clients to guide feature alignment

CoV: Conditional Valve—a module that transforms a feature vector into different representations conditioned on specific inputs (global vs. personalized)

DLG: Deep Leakage from Gradients—an attack method where a server reconstructs training data from uploaded model gradients

t-SNE: t-Distributed Stochastic Neighbor Embedding—a technique for visualizing high-dimensional data by reducing it to two or three dimensions