FedGH: Heterogeneous Federated Learning with Generalized Global Header

📝 Paper Summary

Federated Learning Model Heterogeneity

FedGH enables diverse devices to collaborate in federated learning by training a shared global prediction header on class-averaged representations, decoupling it from heterogeneous local feature extractors.

Core Problem

Standard Federated Learning requires all clients to use identical model architectures (model homogeneity), excluding resource-constrained devices and failing to adapt to diverse local data distributions.

Why it matters:

Low-end edge devices cannot train the large models used by high-end servers, preventing them from participating in collaborative learning
Data on devices is Non-IID (not independently and identically distributed), meaning a single global model often performs poorly on local personalized data
Existing heterogeneous solutions often rely on public datasets (unavailable in practice) or heavy knowledge distillation that incurs high computation/communication costs

Concrete Example: In a visual classification task, a smartwatch can only run a tiny 3-layer CNN, while a powerful server runs ResNet-18. FedAvg fails because the model weights cannot be averaged. FedGH allows collaboration by sharing only the prediction header and representations, not the full architecture.

Key Novelty

Generalized Global Prediction Header Training via Local Averaged Representations (LARs)

Decouples the model into a personalized, heterogeneous feature extractor (kept local) and a homogeneous prediction header (shared)
Clients compute 'prototypes' of their data (average representation per class) and send these to the server instead of model weights or raw data
The server trains the shared header on these lightweight representations and broadcasts it back, replacing the client's local header to inject global knowledge

Architecture

The FedGH workflow, illustrating the separation of local heterogeneous extractors and the shared global header.

Evaluation Highlights

Outperforms state-of-the-art FedProto by +1.33% accuracy on CIFAR-100 in model-heterogeneous settings
Achieves significantly higher accuracy in homogeneous settings (+11.11% vs FedProto on CIFAR-100, N=10)
Reduces communication overhead by 85.53% compared to the best performing baseline (FedProto) on CIFAR-100 while achieving higher accuracy

Breakthrough Assessment

8/10

Offers a simple yet highly effective solution to model heterogeneity that beats complex distillation methods in accuracy and efficiency without needing public data.

⚙️ Technical Details

Problem Definition

Setting: Federated Learning with N clients having non-IID data distributions and heterogeneous model architectures.

Inputs: Local private datasets D_k; Heterogeneous local feature extractors F_k.

Outputs: Trained personalized local models w_k consisting of local extractors and a shared global header.

Pipeline Flow

Local Training: Client trains heterogeneous model on private data
Extraction: Client computes Local Averaged Representations (LARs) per class
Upload: Client sends LARs and labels to Server
Global Training: Server updates Global Header using LARs
Broadcast: Server sends Global Header to Clients
Replacement: Client replaces local header with Global Header

System Modules

Heterogeneous Feature Extractor

Maps raw input data to feature representations; architecture varies across clients (e.g., CNN-1 vs CNN-5)

Model or implementation: Various CNN architectures (CNN-1 to CNN-5) or ResNet-18

Global Prediction Header

Classifies representations; aggregates global knowledge across all clients

Model or implementation: Homogeneous Fully Connected Layer

Novel Architectural Elements

Replacement-based aggregation: Instead of averaging weights, the server trains a header module on representation proxies (LARs) and completely replaces the client's local header module each round

Modeling

Base Model: Heterogeneous CNNs (5 variants, 2.55MB to 10.00MB) or ResNet-18

Training Method: Federated Learning with Split Training mechanism

Objective Functions:

Purpose: Optimize local heterogeneous models on private data.

Formally: Minimize Cross-Entropy Loss on local dataset D_k.
Purpose: Optimize global homogeneous header on server.

Formally: Minimize Cross-Entropy Loss between header predictions on LARs (R_{k,s}) and true class labels s.

Key Hyperparameters:

learning_rate_local: 0.01
learning_rate_global_header: 0.01
batch_size: Grid search {32, 64, 128, 256, 512}
+ 2 more
local_epochs: Grid search {1, 10, 30, 50, 100}
communication_rounds: 100 or 500

Compute: Server trains only a lightweight header; Clients compute forward pass for LAR extraction (low cost). Implemented on NVIDIA GeForce RTX 3090.

Comparison to Prior Work

vs. FedProto: FedGH explicitly trains and shares a parameter-based header rather than just aligning representations via loss; FedGH achieves better accuracy with lower communication.
vs. LG-FedAvg: FedGH trains the header on representations to handle Non-IID data, whereas LG-FedAvg simply averages weights which fails under statistical heterogeneity.
vs. FedMD: FedGH does not require any public dataset.
+ 1 more
vs. FedGen [not cited in paper]: FedGen learns a generator to create samples; FedGH uses actual aggregated representation statistics (LARs) directly, avoiding generator training complexity.

Limitations

Relies on the assumption that representations from different heterogeneous extractors are compatible for a single header (though empirical results suggest they are).
Requires clients to upload class-averaged representations, which is more privacy-preserving than raw data but potentially less than pure gradient updates (though arguably similar to FedProto).
Performance gains diminish as data becomes more IID (Independent and Identically Distributed).

Reproducibility

Code: https://github.com/LipingYi/FedGH

Code is publicly available at https://github.com/LipingYi/FedGH. Datasets (CIFAR-10/100) are public. Detailed model structures for heterogeneous CNNs are provided in Table 2.

📊 Experiments & Results

Evaluation Setup

Image classification on Non-IID partitioned data across clients with heterogeneous models.

Benchmarks:

CIFAR-10 (Image Classification (10 classes))
CIFAR-100 (Image Classification (100 classes))

Metrics:

Average Test Accuracy (%)
Communication Overhead (MB/KB)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Model-Homogeneous Results: Even when models are identical (ResNet-18), FedGH's header training mechanism outperforms standard aggregation.
CIFAR-100 (N=10, C=100%)	Average Test Accuracy (%)	64.63	73.62	+8.99
CIFAR-10 (N=10, C=100%)	Average Test Accuracy (%)	94.34	96.33	+1.99
Model-Heterogeneous Results: Clients use 5 different CNN architectures. FedGH achieves best accuracy and efficiency.
CIFAR-100 (Non-IID: 10/100)	Average Test Accuracy (%)	72.80	74.13	+1.33
CIFAR-100 (Heterogeneous)	Communication Overhead (MB)	101.47	14.69	-86.78
CIFAR-10 (Non-IID: 2/10)	Average Test Accuracy (%)	96.47	97.60	+1.13

Experiment Figures

Test accuracy vs. Communication Rounds for CIFAR-10 and CIFAR-100.

Impact of Non-IID degree (number of classes per client) on accuracy.

Main Takeaways

Consistent superiority in both homogeneous and heterogeneous settings, beating baselines like FedProto and LG-FedAvg.
Extreme communication efficiency: on CIFAR-100, FedGH reduces overhead by ~85% compared to FedProto because it only transmits lightweight headers and representations, not full models or dense gradients.
Robust to Non-IID data: performance gap widens on the harder CIFAR-100 dataset compared to CIFAR-10.
Insensitive to global header learning rate (eta_theta), making it easy to tune.

📚 Prerequisite Knowledge

Prerequisites

Federated Learning (FedAvg)
Convolutional Neural Networks (CNNs)
Knowledge Distillation concepts

Key Terms

FL: Federated Learning—a decentralized ML approach where devices train locally and share updates without exposing raw data

Non-IID: Non-Independent and Identically Distributed—data distribution varies across clients (e.g., one user has only photos of cats, another only dogs)

LAR: Local Averaged Representation—the mean feature vector of all samples belonging to a specific class on a client device

Feature Extractor: The part of the neural network that maps raw inputs (images) to latent vector representations

Prediction Header: The final layers of the neural network (usually fully connected) that map representations to class probabilities

Model Heterogeneity: A scenario in FL where clients use different model architectures (e.g., different sizes/depths) suited to their hardware constraints

Logits: The raw, unnormalized prediction scores output by the final layer of a neural network before applying softmax