Personalized Decentralized Federated Learning with Knowledge Distillation

📝 Paper Summary

Personalized Federated Learning Decentralized Learning Knowledge Distillation

KD-PDFL enables decentralized clients to personalize models by measuring peer similarity through logit-based knowledge distillation on local data, eliminating the need for central servers or public proxy datasets.

Core Problem

In decentralized federated learning, clients must identify similar peers to collaborate effectively, but measuring similarity without sharing private data or assuming a central server is difficult.

Why it matters:

Standard global models fail for clients with highly non-i.i.d. data distributions or unique preferences
Existing decentralized methods often compromise privacy by exchanging class label information or requiring public proxy datasets
Clients with small local datasets struggle to evaluate peer relevance accurately, leading to poor model convergence

Concrete Example: In an IoT device classification task with non-i.i.d. data, a user with only 15 samples cannot effectively train a local model. If they use standard FedAvg, their model gets corrupted by irrelevant updates from dissimilar peers. KD-PDFL allows them to identify and weight only relevant peers based on output similarity.

Key Novelty

Distillation-based Peer Selection (KD-PDFL)

Clients evaluate the relevance of neighbors by feeding their own local data into neighbors' received models and comparing the output logits via Wasserstein distance
This metric acts as a weight for aggregating neighbor models, allowing purely local, autonomous decisions on who to collaborate with
Eliminates the need for shared public datasets or direct label sharing found in prior personalized decentralized approaches

Architecture

The communication and update protocol for a Star Node (center) interacting with neighbors.

Evaluation Highlights

Achieves 81.6% test accuracy on IoT devices dataset with small local data (15-100 samples), significantly outperforming local learning (21.0%)
Surpasses standard FedAvg (61.0%) and FedAvg+ (69.7%) in test accuracy under highly non-i.i.d. settings with 40 participating clients
Demonstrates faster convergence in fewer global iterations compared to FedAvg+ while maintaining higher accuracy

Breakthrough Assessment

6/10

Solid contribution to decentralized FL by removing the need for public proxy data using distillation. While the core mechanics are an application of known techniques (KD + FL), the fully autonomous peer-weighting protocol is valuable for privacy-sensitive edge networks.

⚙️ Technical Details

Problem Definition

Setting: Fully decentralized network with M users, no central server. Each user i minimizes local loss L_i(θ_i, X_i) plus a regularization term based on distance to neighbors.

Inputs: Local private dataset X_i, received model parameters from neighbors N_i(t)

Outputs: Personalized local model parameters θ_i, personalized collaboration graph W_i (weights assigned to neighbors)

Pipeline Flow

Neighbor Selection (Star Node)
Model Exchange
Similarity Evaluation (Distillation)
Weight Update
Model Aggregation

System Modules

Model Exchange

Receive model parameters from selected neighbors

Model or implementation: Neural Network (MLP for IoT, CNN for EMNIST)

Similarity Evaluator

Compute distance between local model outputs and neighbor model outputs using local data

Model or implementation: Local inference on neighbor models

Collaboration Graph Updater

Update weights w_ij based on calculated statistical distances

Model or implementation: Gradient descent on w

Model Aggregator

Update local model using weighted sum of neighbor parameters

Model or implementation: Weighted Averaging

Novel Architectural Elements

Autonomous Collaboration Graph: Each client calculates its own incoming edge weights w_ij privately based on local distillation loss, without sharing these weights
Distillation-based Distance Metric: Using Wasserstein distance between logits generated on local private data to quantify peer similarity in a decentralized setting

Modeling

Base Model: MLP (IoT dataset) or CNN (EMNIST dataset)

Training Method: Decentralized Federated Learning with Dynamic weighted aggregation

Objective Functions:

Purpose: Minimize joint loss including personalization error, dissimilarity penalty, and regularization.

Formally: J(Θ, W) = Σ L_i(θ_i, X_i) + λ1/2 * Σ w_ij * d(i,j) + λ2 * g(w)
Purpose: Update collaboration weights via gradient descent.

Formally: w_ij(t+1) = max(0, w_ij(t) - η * (λ1 * d_W(i,j) + λ2 * ∇g(w)))

Training Data:

IoT Devices: 15-100 samples per user, non-i.i.d. (Dirichlet 0.1)
EMNIST: Balanced split but decentralized distribution

Key Hyperparameters:

exchange_interval_Tex: 20 (IoT), 5 (EMNIST)
lambda_1: Personalization term coefficient (controls sensitivity to distance)
lambda_2: Regularization coefficient (encourages collaboration)
+ 2 more
batch_size: Not reported in the paper
learning_rate: Not reported in the paper

Comparison to Prior Work

vs. FedAvg: KD-PDFL allows personalized models and weighted aggregation based on peer similarity, whereas FedAvg forces equal weights and a single model.
vs. FedAvg+: KD-PDFL integrates personalization during the training process via the collaboration graph, rather than just fine-tuning at the end.
vs. P2P-ML [19] [not cited in paper]: P2P-ML also learns a collaboration graph but typically requires exchanging gradients or more complex information, while KD-PDFL uses model parameters and local distillation.
+ 1 more
vs. Decentralized personalization with proxy data [15]: KD-PDFL does not require a public proxy dataset, preserving better privacy.

Limitations

Computational cost increases linearly with the number of neighbors sampled per round due to local inference on neighbor models.
Requires neighbors to share full model parameters (or at least base layers) to generate logits.
Hyperparameters lambda_1 and lambda_2 require careful tuning to balance personalization and collaboration.
Convergence speed degrades if too many neighbors are considered simultaneously (as shown in Figure 4).

Reproducibility

Code availability is not provided. The paper describes the algorithm steps (Algorithm 1) and datasets (publicly available IoT Devices and EMNIST). Hyperparameters for lambda_1 and lambda_2 effects are visualized but exact optimal values for main results are not explicitly tabulated. Network topology parameters (Rayleigh fading) are described.

📊 Experiments & Results

Evaluation Setup

Cross-silo decentralized federated learning simulation

Benchmarks:

IoT Device Identification (Classification (9 classes))
EMNIST (Image Classification (47 classes))

Metrics:

Per-client Test Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on the IoT Devices dataset (highly non-i.i.d., small local data) show KD-PDFL outperforming baselines, especially as the number of users (M) increases.
IoT Device Identification	Test Accuracy (M=10)	0.802	0.816	+0.014
IoT Device Identification	Test Accuracy (M=40)	0.697	0.716	+0.019
IoT Device Identification	Test Accuracy (M=10)	0.210	0.816	+0.606
Results on EMNIST dataset (larger models, image data) confirm the trend.
EMNIST	Test Accuracy (M=40)	0.841	0.870	+0.029

Experiment Figures

Learning curves (Test Accuracy vs Iterations) for Local, FedAvg, FedAvg+, and KD-PDFL on IoT dataset (M=40).

Test accuracy improvement relative to the number of connecting peers per slot.

Main Takeaways

KD-PDFL consistently outperforms FedAvg and FedAvg+ across different network sizes (M=10 to 40) and datasets.
Clients with very small datasets (15-100 samples) see the largest relative benefits compared to local learning.
There is a trade-off in neighbor selection: communicating with too many peers at once (high connectivity) can slow down convergence, similar to the drift seen in global FedAvg.
The method is robust to non-i.i.d. data distributions where standard averaging fails.

📚 Prerequisite Knowledge

Prerequisites

Federated Learning (centralized vs. decentralized)
Knowledge Distillation (logits, soft targets)
Wasserstein Distance (optimal transport metric)
Non-i.i.d. data distributions

Key Terms

FedAvg: Federated Averaging—standard FL algorithm that aggregates client updates by simple averaging

FedAvg+: Federated Averaging followed by local fine-tuning (similar to Reptile)

non-i.i.d.: Non-independent and identically distributed—data where the distribution of classes varies significantly across clients

logits: Raw, unnormalized predictions generated by the last layer of a neural network before the softmax activation

collaboration graph: A weighted graph where edge weights represent the relevance or similarity between the learning tasks of connected nodes

Wasserstein distance: A distance metric between probability distributions, measuring the minimum 'cost' to transform one distribution into another

star node: A temporarily selected node that coordinates aggregation for a subset of neighbors in a decentralized round (dynamic role)

co-distillation: A collaborative learning method where peer models exchange knowledge by mimicking each other's outputs (predictions) rather than just parameters