pFedMoE: Data-Level Personalization with Mixture of Experts for Model-Heterogeneous Personalized Federated Learning

📝 Paper Summary

Model-Heterogeneous Personalized Federated Learning (MHPFL) Mixture of Experts (MoE)

pFedMoE improves personalized federated learning by combining a shared small expert (for general features) and a private heterogeneous expert (for personal features) via a lightweight gating network, balancing knowledge at the individual data sample level.

Core Problem

Existing Model-Heterogeneous Personalized Federated Learning (MHPFL) methods struggle to balance generalized and personalized knowledge at a fine-grained data level while maintaining model privacy and low communication costs.

Why it matters:

Clients often have heterogeneous devices and models, making standard model-homogeneous FL impossible.
Data is non-IID (non-independent and identically distributed) across clients, meaning a single global model fits poorly.
Prior methods using knowledge distillation or mutual learning incur high computational costs or fail to adapt dynamically to specific data samples.

Concrete Example: In a medical imaging scenario, one hospital uses a large ResNet while a clinic uses a small MobileNet. Standard FL fails due to architecture mismatch. Existing MHPFL methods might force them to distill knowledge via public data (privacy risk) or fix ensemble weights for all images, ignoring that some images contain unique local features needing the local expert more than the global one.

Key Novelty

Heterogeneous Local MoE with Shared Small Expert

Constructs a local Mixture of Experts (MoE) on each client comprising: (1) a private heterogeneous feature extractor (local expert), (2) a shared homogeneous small feature extractor (global expert), and (3) a gating network.
The gating network dynamically weights the contribution of local vs. global experts for *each specific data sample*, achieving data-level personalization rather than just client-level.
Only the small shared expert is transmitted to the server, preserving privacy and supporting completely different local model architectures.

Architecture

The workflow of pFedMoE, illustrating the local MoE structure and the interaction with the server.

Evaluation Highlights

Achieves up to 2.80% higher test accuracy compared to the state-of-the-art (FedAPEN) on benchmark datasets.
Outperforms the best baseline in the same category (model mixture methods) by up to 22.16% in specific non-IID settings.
Significantly reduces communication costs compared to methods that transmit large models or generators, exchanging only small feature extractors.

Breakthrough Assessment

7/10

Strong conceptual advance in applying MoE to heterogeneous FL for fine-grained personalization. Results show solid improvements over SOTA, though the approach of sharing a small model is an evolution of existing split-learning/mutual-learning ideas.

⚙️ Technical Details

Problem Definition

Setting: Supervised classification in a Federated Learning (FL) setting with N clients, where each client k has a distinct local model architecture F_k and non-IID local data D_k.

Inputs: Local private data samples (x, y) on client devices.

Outputs: Personalized model predictions y_hat minimizing local loss.

Pipeline Flow

Local MoE Construction: Client integrates local heterogeneous expert + shared homogeneous expert + gating network
Local Training: MoE processes data; gating network weights experts; mixed features go to prediction header
Aggregation: Server aggregates only the shared homogeneous experts from clients

System Modules

Local Expert (Feature Extraction)

Extract personalized features from local data using the client's private large model architecture

Model or implementation: Heterogeneous architectures (e.g., ResNet-10, ResNet-12, ShuffleNet, MobileNet-v2)

Global Expert (Feature Extraction)

Extract generalized features; facilitates cross-client knowledge transfer via aggregation

Model or implementation: Homogeneous small CNN (2 convolutional layers)

Gating Network

Determine the mixing weights for local and global expert representations based on the input sample

Model or implementation: Lightweight linear layers or small MLP

Prediction Header

Generate final classification from the weighted mixed representation

Model or implementation: MLP (classifier part of the heterogeneous local model)

Novel Architectural Elements

Hybrid MoE structure combining a static heterogeneous local expert with a communicable homogeneous global expert.
Simultaneous end-to-end training of the MoE (experts + gate + header) rather than alternate or multi-stage training.

Modeling

Base Model: Various CNNs (ResNet-10, ResNet-12, ShuffleNet, MobileNet-v2) as local experts; Simple 2-layer CNN as global expert

Training Method: Federated Learning with SGD

Objective Functions:

Purpose: Minimize classification error on local data.

Formally: Standard Cross-Entropy Loss on the output of the Prediction Header.

Key Hyperparameters:

learning_rate: 0.01
batch_size: 32 or 128 (dataset dependent)
epochs: 100 or 200 (communication rounds)
+ 1 more
local_epochs: 5

Compute: Not reported in the paper

Comparison to Prior Work

vs. FedAPEN: pFedMoE updates weights dynamically per sample (data-level) via a gating network, whereas FedAPEN learns static client-level weights.
vs. FedProto: pFedMoE does not require transmitting class prototypes, avoiding potential privacy leakage of class statistics.
vs. FedKD: pFedMoE trains experts simultaneously rather than alternately, reducing training complexity.
+ 1 more
vs. PFL-MoE [not cited in paper]: PFL-MoE uses MoE for homogeneous settings; pFedMoE adapts it for heterogeneous models by decoupling the feature extractor architecture.

Limitations

Requires a shared homogeneous small expert, which adds a small computational overhead to the local client.
The shared expert must have the same output dimension as the local expert's feature extractor for mixing.
Performance depends on the ability of the small global expert to capture generalized knowledge.
No rigorous differential privacy guarantees provided, though raw data is not shared.

Reproducibility

Code availability is not explicitly provided in the paper text. Datasets used (CIFAR-10, CIFAR-100) are standard. Hyperparameters for baselines and the proposed method are detailed in the experiment section.

📊 Experiments & Results

Evaluation Setup

Simulated FL with heterogeneous client models on image classification tasks.

Benchmarks:

CIFAR-10 (Image Classification)
CIFAR-100 (Image Classification)

Metrics:

Test Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis against state-of-the-art MHPFL methods on CIFAR-10 with varying non-IID settings (Dirichlet distribution alpha). pFedMoE consistently outperforms the best baselines.
CIFAR-10 (alpha=0.1)	Test Accuracy	89.04	91.84	+2.80
CIFAR-10 (alpha=0.5)	Test Accuracy	90.72	92.44	+1.72
CIFAR-100 (alpha=0.1)	Test Accuracy	62.47	63.79	+1.32
Comparison against same-category (Model Mixture) baselines shows dramatic improvements, highlighting the limitations of simple mixing strategies.
CIFAR-10 (alpha=0.1)	Test Accuracy	69.68	91.84	+22.16

Experiment Figures

Test accuracy convergence curves on CIFAR-10 with alpha=0.1.

Main Takeaways

pFedMoE consistently achieves the highest accuracy across varying degrees of data heterogeneity (alpha=0.1, 0.5) compared to diverse baselines (Knowledge Distillation, Model Mixture, Mutual Learning).
The improvement is most pronounced in highly non-IID settings (alpha=0.1), validating the 'data-level personalization' hypothesis.
The method incurs lower communication costs than generator-based methods (like FedGen) and comparable costs to other efficient MHPFL methods, as only the small expert is transmitted.

📚 Prerequisite Knowledge

Prerequisites

Federated Learning (FL) basics (FedAvg)
Mixture of Experts (MoE) architecture
Knowledge Distillation concepts
Non-IID data challenges

Key Terms

MHPFL: Model-Heterogeneous Personalized Federated Learning—FL where clients have different model architectures (e.g., ResNet vs. MobileNet) but collaborate to improve performance.

MoE: Mixture of Experts—A neural network architecture where different sub-models (experts) are activated by a gating network for different inputs.

non-IID: Non-Independent and Identically Distributed—Data distributions vary across clients (e.g., one client has only cats, another only dogs).

Gating Network: A small neural network that outputs probability weights to mix the outputs of different expert models.

Feature Extractor: The initial layers of a neural network that map raw input (e.g., pixels) to a latent vector representation.

FedAvg: Federated Averaging—The standard algorithm for FL where local model weights are averaged by a central server.

Knowledge Distillation: Training a student model to mimic the output (logits/features) of a teacher model.