Towards More Suitable Personalization in Federated Learning via Decentralized Partial Model Training

📝 Paper Summary

Personalized Federated Learning (PFL) Decentralized Federated Learning (DFL)

DFedAlt and DFedSalt achieve superior personalized federated learning by decoupling models into local-only and shared parts, updating them alternately, and stabilizing the shared component via sharpness-aware minimization.

Core Problem

Centralized Personalized Federated Learning suffers from single-point failure risks and communication bottlenecks, while existing decentralized methods share full models, causing 'catastrophic forgetting' of unique local features.

Why it matters:

Centralized servers are vulnerable to disruption and bandwidth constraints in real-world FL systems
Full model aggregation in decentralized settings dilutes the specific knowledge required for highly heterogeneous client data
Optimizing personalized performance in peer-to-peer networks is difficult due to the lack of a global coordinate to align model parameters

Concrete Example: In a decentralized network where one client classifies only 2 distinct CIFAR-10 classes, standard Decentralized FedAvg forces it to average its full model with neighbors holding different classes. This dilutes its specialized weights, resulting in lower accuracy (e.g., ~54% vs 66% for partial personalization) because the unique local information is overwritten by irrelevant neighbor features.

Key Novelty

Decentralized Partial Model Training (DFedAlt) & Sharpness-Aware Decentralized Training (DFedSalt)

Decomposes the model into a 'personal' head (kept local) and a 'shared' body (gossiped with neighbors), updating them alternately to preserve local specificity while leveraging shared feature extraction
Integrates Sharpness-Aware Minimization (SAM) into the shared parameters' update (DFedSalt), adding perturbations to find flat minima that generalize better across heterogeneous clients

Architecture

Overview of DFedAlt and DFedSalt frameworks showing the alternating update and communication flow.

Evaluation Highlights

+1.99% accuracy improvement on CIFAR-10 (Dirichlet-0.3) compared to the best centralized baseline (Fed-RoD)
+7.16% accuracy improvement on CIFAR-100 (Pathological-10) compared to the best decentralized baseline (DFedSAM)
Achieves target accuracy in ~160 rounds vs. ~354 rounds for standard decentralized methods (DFedAvgM) on CIFAR-10, demonstrating significantly faster convergence

Breakthrough Assessment

8/10

Successfully marries decentralized learning with partial personalization, outperforming both centralized SOTA and full-model decentralized methods. The theoretical analysis in a non-convex setting is a strong addition.

⚙️ Technical Details

Problem Definition

Setting: Decentralized non-convex finite-sum minimization over a graph G=(N,V,W) with partial personalization

Inputs: Local dataset D_i on client i

Outputs: Personalized model w_i = (u_i, v_i) where u is shared and v is personal

Pipeline Flow

Local Update: Personal Parameters (v_i)
Local Update: Shared Parameters (u_i) via SGD or SAM
Decentralized Aggregation: Gossip shared parameters (u_i) with neighbors
Repeat for T rounds

System Modules

Personal Parameters (v_i)

Captures client-specific patterns (typically the classification head/linear layers)

Model or implementation: ResNet-18 final linear layers

Shared Parameters (u_i)

Learns generalizable features (typically the convolutional body)

Model or implementation: ResNet-18 convolutional layers (GroupNorm)

Aggregator

Mixes local shared parameters with neighbors based on topology W

Model or implementation: Weighted Average

Novel Architectural Elements

Decentralized alternating optimization scheme where only sub-components of the model are gossiped
Integration of local SAM optimizer specifically on the shared partition of a decentralized model

Modeling

Base Model: ResNet-18 (Batch Normalization replaced with Group Normalization)

Training Method: Decentralized Stochastic Gradient Descent with Alternating Updates (DFedAlt) or SAM (DFedSalt)

Objective Functions:

Purpose: Minimize global loss via alternating updates.

Formally: min_{u,V} (1/m) * sum(F_i(u, v_i))
Purpose: (DFedSalt) Find flat minima for shared parameters.

Formally: u_{t,k+1} = u_{t,k} - eta * grad(F_i(u_{t,k} + epsilon, v_{t+1}))

Key Hyperparameters:

learning_rate_shared (eta_u): 0.1
learning_rate_personal (eta_v): 0.001
decay_rate: 0.005
+ 6 more
momentum: 0.9
perturbation_ratio (rho): 0.7
local_epochs_shared: 5
local_epochs_personal: 1 (Dirichlet) or 5 (Pathological)
batch_size: 128
communication_rounds: 500

Compute: Run on 100 clients (simulated). Hardware specifics not reported.

Comparison to Prior Work

vs. FedRep: Uses decentralized peer-to-peer topology instead of a central server
vs. DFedAvgM: Splits model into shared/personal parts and updates alternately
vs. DFedSAM: Applies SAM only to the shared body and keeps the head personalized/local, rather than sharing the full SAM-updated model
+ 1 more
vs. BrainTorrent [not cited in paper]: BrainTorrent is serverless PFL but shares full models; DFedAlt shares partial models

Limitations

Evaluation limited to image classification tasks (CIFAR, Tiny-ImageNet)
Does not address privacy attacks (e.g., inversion attacks) on the shared parameters
Requires synchronized communication rounds for simulation, though applicable to asynchronous settings in theory
No statistical significance tests reported

Reproducibility

No replication artifacts mentioned in the paper (code_url not provided). Hyperparameters and data partition methods are described in detail.

📊 Experiments & Results

Evaluation Setup

Simulated decentralized network with 100 clients using ResNet-18

Benchmarks:

CIFAR-10 (Image Classification)
CIFAR-100 (Image Classification)
Tiny-ImageNet (Image Classification)

Metrics:

Test Accuracy (Personal Accuracy)
Communication Rounds to Target Accuracy
Statistical methodology: Mean and standard deviation reported over 3 runs. No statistical significance tests reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison against SOTA Centralized and Decentralized baselines on CIFAR-10/100 under varying heterogeneity.
CIFAR-10 (Dirichlet-0.3)	Test Accuracy	85.68	87.67	+1.99
CIFAR-100 (Pathological-10)	Test Accuracy	73.59	74.50	+0.91
CIFAR-100 (Pathological-10)	Test Accuracy	67.34	74.50	+7.16
CIFAR-10 (Pathological-2)	Test Accuracy	59.76	67.03	+7.27
Efficiency analysis showing communication rounds needed to reach target accuracy.
CIFAR-10 (Dir-0.3)	Communication Rounds (to 80% acc)	354	131	-223
CIFAR-10 (Dir-0.3)	Communication Rounds (to 80% acc)	170	131	-39

Experiment Figures

Comparison of training progress (Accuracy vs. Rounds) between FedAvg and Decentralized FedAvg (DFedAvg) on CIFAR-10/100.

Main Takeaways

Decentralized training is more suitable for partial personalization than centralized methods, achieving higher accuracy in fewer rounds.
Applying SAM (Sharpness-Aware Minimization) to the shared parameters (DFedSalt) significantly improves performance over simple averaging (DFedAlt), especially in highly heterogeneous settings.
The method is robust to sparse network topologies (Ring, Grid) compared to full-model sharing baselines.
Increasing local epochs for personal parameters helps in pathological (extreme) heterogeneity but hurts in mild heterogeneity (Dirichlet).

📚 Prerequisite Knowledge

Prerequisites

Federated Learning (Centralized vs. Decentralized)
Partial Model Personalization (Head vs. Body splitting)
Stochastic Gradient Descent (SGD)
Sharpness-Aware Minimization (SAM)

Key Terms

PFL: Personalized Federated Learning—training distinct models for each client to handle data heterogeneity

DFL: Decentralized Federated Learning—clients communicate peer-to-peer without a central server

SAM: Sharpness-Aware Minimization—an optimization technique that minimizes both loss value and loss sharpness (geometry) to improve generalization

Partial Personalization: Splitting a neural network into shared layers (aggregated across clients) and personal layers (kept local)

Pathological distribution: Extreme non-IID data setting where clients only hold data from a very small subset of total classes (e.g., 2 out of 10)

Gossip matrix: A matrix W defining how much weight each client gives to its neighbors' models during aggregation

Spectral gap: A measure of connectivity in the graph topology; a larger gap implies faster mixing of information