pFedSim: Similarity-Aware Model Aggregation Towards Personalized Federated Learning

📝 Paper Summary

Personalized Federated Learning (pFL) Non-IID Data Handling

pFedSim improves personalized Federated Learning by decoupling models into feature extractors and classifiers, using classifier distances to identify similar clients for aggregation without exposing external data.

Core Problem

In Federated Learning, non-IID data distributions across clients cause single global models to perform poorly, but existing personalization methods often require exposing sensitive metadata or incur high communication costs.

Why it matters:

Data heterogeneity (non-IID) is a fundamental challenge in FL, potentially causing model divergence or severe performance drops on local data.
Existing similarity-based pFL methods often risk privacy by requiring sharing of label distributions or auxiliary data representations.
Model decoupling approaches typically train feature extractors globally without personalization, missing opportunities for finer-grained adaptation.

Concrete Example: Consider two clients: one with images of dogs and cats (labels 0-1), another with cars and trucks (labels 8-9). A standard FL model averages their weights, degrading performance for both. pFedSim detects they are dissimilar via their classifiers and aggregates their feature extractors only with other similar clients (e.g., other animal-image holders).

Key Novelty

Similarity-Aware Model Decoupling

Decouples neural networks into a 'feature extractor' and a 'classifier'; the classifier is kept local to capture personalization.
Uses the distance between local classifiers as a proxy for data similarity, enabling the server to aggregate feature extractors only from clients with similar data distributions.
Operates in two phases: a 'Generalization' warm-up phase (standard FedAvg) followed by a 'Personalization' phase where aggregation weights are adjusted based on the classifier similarity matrix.

Evaluation Highlights

Achieves highest accuracy across CIFAR-10, CINIC-10, Tiny-ImageNet, and EMNIST compared to 11 baselines.
Improves model accuracy by up to ~10% on Tiny-ImageNet (Dirichlet 0.1) compared to FedAvg.
Outperforms state-of-the-art pFL method FedAP by ~22% on Tiny-ImageNet (Dirichlet 0.1) without requiring public data or sharing batch-norm statistics.

Breakthrough Assessment

7/10

Strong empirical results and a privacy-friendly design for similarity estimation. While the components (decoupling, similarity aggregation) exist, the specific combination using classifier weights as a proxy is effective and practical.

⚙️ Technical Details

Problem Definition

Setting: Federated Learning with $n$ clients, each having private non-IID dataset $D_i$. Goal is to learn personalized models $\theta_1, ..., \theta_n$ to minimize local losses.

Inputs: Local private datasets $D_i$ on clients; model parameters $\theta$ exchanged with server.

Outputs: Personalized model parameters $\theta_i = \omega_i \circ \phi_i$ for each client $i$.

Pipeline Flow

Generalization Phase: Run standard FedAvg for $T_g$ rounds.
Personalization Phase: Decouple model into Feature Extractor ($\omega$) and Classifier ($\phi$).
Server: Compute similarity matrix $\Phi$ using uploaded classifiers.
Server: Aggregate feature extractors using weights from $\Phi$.
Client: Receive personalized $\omega_i$, keep local $\phi_i$, update both via local SGD.

System Modules

Classifier Similarity Estimator (Server Aggregation)

Calculates the similarity matrix between clients based on the cosine similarity of their classifier weights.

Personalized Aggregator (Server Aggregation)

Aggregates feature extractors for client $i$ using a weighted average of other clients' extractors, weighted by $\Phi_{ij}$.

Local Trainer

Updates both feature extractor and classifier on local private data.

Model or implementation: LeNet5 or MobileNetV2

Novel Architectural Elements

Dual-phase training schedule (Generalization -> Personalization) specifically designed to initialize classifiers before using them for similarity estimation
Aggregation logic that uses classifier weights (usually kept local in decoupling methods) specifically to guide the aggregation of feature extractors (usually averaged globally)

Modeling

Base Model: LeNet5 (CIFAR-10, CINIC-10, EMNIST) and MobileNetV2 (Tiny-ImageNet)

Trainable Parameters: All layers (feature extractor + classifier)

Key Hyperparameters:

learning_rate: 0.01
batch_size: 32
local_epochs: 5
+ 3 more
communication_rounds: 200
join_ratio: 0.1
generalization_ratio_rho: 0.5

Compute: Intel Xeon Gold 6226R CPU, NVIDIA GeForce RTX 3090

Comparison to Prior Work

vs. FedPer/FedRep: pFedSim aggregates feature extractors based on similarity rather than global averaging, allowing personalized feature representation.
vs. FedAP: pFedSim uses classifier weights for similarity, avoiding the need to transmit batch-norm statistics or auxiliary data representations.
vs. FedFomo: pFedSim computes aggregation weights at the server, avoiding the communication cost of sending multiple models to clients.

Limitations

Relies on the assumption that classifier similarity correlates strongly with data similarity (validated empirically but may not hold for all architectures/tasks).
Requires a 'Generalization' phase; setting the ratio $\rho$ is a hyperparameter that affects performance.
Computation of the $N \times N$ similarity matrix at the server scales quadratically with the number of participating clients (though only computed for active clients per round).

📊 Experiments & Results

Evaluation Setup

Image classification on non-IID data partitions.

Benchmarks:

CIFAR-10 (Image Classification)
CINIC-10 (Image Classification)
Tiny-ImageNet (Image Classification)
EMNIST (Handwritten Character Recognition)

Metrics:

Top-1 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on highly non-IID data (Dirichlet alpha=0.1) shows pFedSim consistently outperforming baselines.
Tiny-ImageNet	Accuracy	57.28	64.91	+7.63
EMNIST	Accuracy	95.12	95.70	+0.58
CINIC-10	Accuracy	84.30	84.34	+0.04
Performance on moderately non-IID data (Dirichlet alpha=0.5).

Experiment Figures

CKA (Centered Kernel Alignment) similarity heatmaps for different layers of LeNet5 trained on non-IID data.

Cosine similarities of classifiers trained on different CIFAR-10 subsets with varying degrees of label overlap.

Main Takeaways

pFedSim consistently outperforms standard FL (FedAvg, FedProx) and state-of-the-art pFL methods (FedRep, FedPer, FedAP) across various non-IID settings.
The method is particularly effective on complex datasets like Tiny-ImageNet, showing gains of ~7-10% over strong baselines.
Classifier-based similarity is empirically shown to be a better proxy for data similarity than batch-norm statistics (WDB) or loss-based metrics (LDB).
Robust to hyperparameter $\rho$ (generalization ratio) choices within the range [0.3, 0.7], with $\rho=0.5$ generally being optimal.

📚 Prerequisite Knowledge

Prerequisites

Federated Learning basics (FedAvg)
Neural Network architecture (Feature extractor vs. Classifier)
Cosine Similarity

Key Terms

pFL: Personalized Federated Learning—techniques to adapt the global FL model to individual client data distributions.

Non-IID: Non-Independent and Identically Distributed—data on different clients follows different statistical distributions.

Model Decoupling: Splitting a neural network into a feature extractor (lower layers) and a classifier (final layers) to treat them differently during training/aggregation.

CKA: Centered Kernel Alignment—a similarity index for comparing representations of neural network layers.

Generalization Phase: An initial warm-up period running standard FedAvg to get a reasonable base model before personalization begins.

Dirichlet distribution: A probability distribution used here to partition data among clients to simulate varying degrees of non-IID heterogeneity (controlled by parameter $\alpha$).

Feature Extractor: The initial layers of a network (e.g., convolutional layers) that transform raw inputs into latent representations.

Classifier: The final layers of a network (e.g., fully connected layers) that map latent representations to output class probabilities.