Towards Personalized Federated Learning via Heterogeneous Model Reassembly

📝 Paper Summary

Heterogeneous Federated Learning Model Personalization

pFedHR enables personalized federated learning for clients with different model architectures by disassembling client models into layers, regrouping them by function, and stitching them into new candidate models for matching.

Core Problem

Standard Federated Learning (FL) requires identical model architectures across clients, preventing participation from heterogeneous clients (e.g., companies with proprietary models). Existing solutions rely on consensus (eroding personalization) or predefined distillation structures (limiting flexibility).

Why it matters:

Real-world FL participants often possess distinct, pre-existing models and cannot simply replace them with a global architecture
Current methods using public data for alignment suffer severe performance drops when public data distribution differs from client data
Consensus-based methods (averaging logits/representations) compromise privacy and dilute the unique characteristics necessary for personalization

Concrete Example: If Client A has a shallow CNN and Client B has a deep ResNet, standard FedAvg cannot aggregate their weights. Existing methods might force them to distill knowledge into a shared 'student' model, but if the public dataset used for distillation differs from their private data (e.g., SVHN vs. CIFAR), performance collapses (e.g., 3.5% to 12% drops in baselines).

Key Novelty

Heterogeneous Model Reassembly (pFedHR)

Treats personalization as a matching problem: instead of averaging weights, the server decomposes uploaded models into layers and reassembles them into many new candidate architectures
Uses 'function-driven layer grouping' to cluster layers from different clients based on their behavior on public data, ensuring semantically similar layers are grouped regardless of dimensions
Selects the single best-matching reassembled candidate for each client based on output similarity, then stitches the layers together using lightweight linear connectors

Architecture

Overview of the pFedHR framework, detailing the interaction between Client Update and Server Update phases.

Evaluation Highlights

Outperforms FedGH by +1.74% (83.68% vs 81.94%) on SVHN (IID setting) using labeled public data
Achieves 78.98% accuracy on SVHN with 100 clients (large-scale setting), surpassing FCCL (75.03%) and FedKEAF (76.27%)
Maintains robustness when public data differs from client data (e.g., training on SVHN using CIFAR-10 public data), showing significantly lower performance drops compared to baselines

Breakthrough Assessment

8/10

Novel application of model reassembly/stitching to the FL heterogeneity problem. Successfully moves away from consensus/distillation paradigms, addressing a critical practicality gap in FL.

⚙️ Technical Details

Problem Definition

Setting: Federated Learning with heterogeneous client models {w_1...w_B} and a server-side public dataset D_p

Inputs: B client models with different architectures, Public dataset D_p (labeled or unlabeled)

Outputs: Personalized model ŵ_n for each client n, constructed from components of other clients' models

Pipeline Flow

Clients upload heterogeneous models to Server
Server: Layer-wise Decomposition (extract layers/operations)
Server: Function-driven Layer Grouping (cluster layers using CKA on public data)
Server: Reassembly Candidate Generation (heuristic search for valid architectures)
Server: Similarity Matching (stitch candidates, fine-tune on D_p, find best match for each client)
Client: Download matched model and update via Knowledge Distillation

System Modules

Layer Decomposer (Server Aggregation)

Breaks down uploaded models into individual layers and operation types

Model or implementation: Parsing algorithm

Layer Grouper (Server Aggregation)

Clusters layers by functional similarity using CKA distance on public data outputs

Model or implementation: K-means style clustering

Candidate Generator (Server Aggregation)

Generates M candidate models by selecting layers from different clusters based on heuristic rules (valid order, diversity)

Model or implementation: Heuristic Rule-based Search (Algorithm 1)

Model Stitcher (Server Aggregation)

Connects layers of different dimensions in candidate models

Model or implementation: Linear Layer + ReLU

Client Distiller

Transfers knowledge from the downloaded reassembled model to the local model

Model or implementation: Local Client Model

Novel Architectural Elements

Dynamic decomposition and reassembly of heterogeneous client models into new candidate architectures on the server
Use of 1x1 linear stitching layers (ReLU(Wx+b)) to align dimensions between layers from different original models

Modeling

Base Model: Various CNN architectures (Simple CNNs to Complex models)

Training Method: Federated Learning with Server-side Stitching and Client-side Distillation

Objective Functions:

Purpose: Group layers by functional similarity.

Formally: min sum(delta * dis(Layer, Cluster_Center)) using CKA-based distance.
Purpose: Select best candidate model for a client.

Formally: argmax sim(client_model, candidate_model) using cosine similarity of logits on public data.
Purpose: Fine-tune stitched candidates on server.

Formally: Cross-Entropy (if labeled) or Contrastive Loss (if unlabeled).
Purpose: Update client model using knowledge from matched candidate.

Formally: CE(y, pred) + lambda * KL(teacher_logits, student_logits).

Key Hyperparameters:

local_training_epochs: 10
server_finetuning_epochs: 3
number_of_clusters_K: 4
+ 3 more
public_data_ratio: 10% of training data
client_count_N: 12 (small setting) or 100 (large setting)
active_clients_B: 4 (small setting) or 10 (large setting)

Compute: Not reported in the paper

Comparison to Prior Work

vs. FedMD/FCCL: pFedHR avoids consensus/averaging of logits, preserving unique model traits via reassembly
vs. FedKEMF: pFedHR does not require predefined shared model structures; it dynamically generates them
vs. FedDF [not cited in paper]: FedDF relies on ensemble distillation into a global model; pFedHR constructs personalized models per client

Limitations

Relies on the existence of a public dataset (labeled or unlabeled) at the server
Process involves computational overhead for CKA calculation and candidate search, especially with many clients (mitigated by averaging models first in large-scale setting)
Trade-off between number of clusters K and number of valid candidates M (strict rules may yield empty sets if K is too large)

Reproducibility

Code: https://github.com/JackqqWang/pfedHR

publicly available (https://github.com/JackqqWang/pfedHR). Detailed algorithms provided. Hyperparameters for specific experiments (K, epochs) listed.

📊 Experiments & Results

Evaluation Setup

Image classification on MNIST, SVHN, CIFAR-10. Data split 80/20 train/test. 10% of training data used as server public data.

Benchmarks:

MNIST (Image Classification)
SVHN (Image Classification)
CIFAR-10 (Image Classification)

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance under Heterogeneous Settings (Small Client Number, N=12) with Labeled Public Data.
SVHN	Accuracy (%)	81.94	83.68	+1.74
CIFAR-10	Accuracy (%)	72.69	73.88	+1.19
SVHN	Accuracy (%)	81.06	83.40	+2.34
Performance under Heterogeneous Settings with Unlabeled Public Data.
SVHN	Accuracy (%)	82.03	83.15	+1.12
CIFAR-10	Accuracy (%)	68.77	69.38	+0.61
Large Scale Client Experiment (N=100) on SVHN.
SVHN	Accuracy (%)	78.16	80.02	+1.86
SVHN	Accuracy (%)	76.27	78.98	+2.71

Experiment Figures

Bar charts comparing performance drops of different methods when using public data with different distributions (e.g., training on SVHN using CIFAR public data vs SVHN public data).

Visualization of the personalized candidate models generated for a specific client at different epochs.

Main Takeaways

Consistent State-of-the-Art: pFedHR outperforms baselines (FedMD, FedGH, FCCL, FedKEMF) across MNIST, SVHN, and CIFAR-10 in both IID and Non-IID settings.
Robustness to Public Data Distribution: When public data (e.g., CIFAR-10) differs from client data (e.g., SVHN), pFedHR suffers significantly smaller performance drops (~2-5%) compared to baselines (~10-12%).
Stitching Efficiency: Increasing the complexity of stitching layers or the number of fine-tuning epochs actually degrades performance, validating the design choice of simple linear stitches and minimal fine-tuning to preserve original model information.
Scalability: The method remains effective with 100 clients, maintaining superiority over baselines despite the increased difficulty of data fragmentation.

📚 Prerequisite Knowledge

Prerequisites

Federated Learning (FedAvg, FL aggregation)
Neural Network Pruning/decomposition
Knowledge Distillation
Centered Kernel Alignment (CKA)

Key Terms

CKA: Centered Kernel Alignment—a similarity metric used to compare representations of neural network layers even if they have different dimensions

Model Reassembly: The process of taking layers from different trained models and combining them to form a new functional model

Stitching Layer: A lightweight layer (usually linear + activation) inserted between two disparate network layers to align their feature dimensions

Heterogeneous FL: Federated Learning where clients have different model architectures (e.g., different depths, layer types)

Knowledge Distillation: Training a 'student' model to mimic the outputs (logits) of a 'teacher' model, used here to transfer knowledge from the reassembled model to the client's local model

IID vs. Non-IID: Independent and Identically Distributed vs. Non-IID—refers to whether client data distributions are uniform or skewed (e.g., each client only has specific classes)

Logits: The raw, unnormalized prediction scores output by the last layer of a neural network before the softmax activation