FedSoup: Improving Generalization and Personalization in Federated Learning via Selective Model Interpolation

📝 Paper Summary

Personalized Federated Learning (PFL) Out-of-Distribution Generalization

FedSoup improves both local personalization and global generalization in federated learning by selectively averaging historical global models into a client-specific 'soup' and interpolating this with local models to find flat minima.

Core Problem

Current Federated Learning algorithms face a severe trade-off between local performance (personalization) and global performance (generalization) when handling heterogeneous data distributions.

Why it matters:

Personalized FL methods (like FedRep) often overfit local data, leading to sharp minima that fail to generalize to out-of-distribution (OOD) data.
Medical imaging scenarios suffer from significant distribution shifts (e.g., different scanners/hospitals), requiring models that work well locally but also robustly across institutions.

Concrete Example: In a cross-silo setting with multiple hospitals, a model trained on Hospital A's data (personalization) might perform poorly on the joint distribution of all hospitals (generalization) because it settles into a sharp, narrow valley in the loss landscape specific to Hospital A.

Key Novelty

Federated Model Soups (FedSoup)

Adapt 'Model Soups' to FL by using historical global models from different training rounds as ingredients, rather than training many models from scratch.
Each client maintains a personalized 'soup' by greedily selecting global models based on local validation performance.
Interpolate (patch) the local model with the client-specific global soup during training to encourage the model towards flat minima, bridging the local-global gap.

Architecture

Overview of the FedSoup method compared to common PFL methods. It illustrates the 'sharp valley' vs 'flat minima' concept.

Evaluation Highlights

+2.87 AUC improvement on unseen domain generalization for pathology image classification compared to FedAvg.
Achieves competitive local personalization accuracy (90.92%) while significantly outperforming baselines in global generalization AUC (96.00%) on Retinal Fundus datasets.
Reduces the sharpness of the loss landscape (measured by Hessian eigenvalues) compared to FedAvg and FedProx, confirming the method finds flatter minima.

Breakthrough Assessment

7/10

Offers a practical, compute-efficient solution to the personalization-generalization trade-off in FL using model interpolation. Strong empirical results on medical data, though the core 'soup' concept is adapted from centralized learning.

⚙️ Technical Details

Problem Definition

Setting: Cross-silo Federated Learning with N clients, each having a local distribution Di. The goal is to minimize local empirical risk (personalization) while also minimizing risk on the joint global distribution D and unseen domains T.

Inputs: Local private datasets (images) distributed across N clients; global model parameters received from a central server.

Outputs: A personalized model f(·; θ_i) for each client that performs well on local test data and generalizes to the global distribution.

Pipeline Flow

Server aggregates local updates → Global Model Broadcast
Client receives Global Model
Temporal Model Selection (Greedy check against local validation set)
Federated Model Patching (Interpolate Local and Global Soup)
Local Training (SGD)
Upload to Server

System Modules

Global Model Aggregator

Aggregates local model updates from clients using standard FedAvg

Model or implementation: Same architecture as clients (ResNet-18)

Temporal Model Selection (Client-side)

Decides whether to add the current global model to the client's local 'soup' history based on validation accuracy

Model or implementation: Evaluation logic

Federated Model Patching (Client-side)

Interpolates the current local model with the averaged soup to serve as the starting point for local training

Model or implementation: Weight Averaging

Novel Architectural Elements

Client-specific global model pool maintenance (each client keeps a unique subset of historical global models).
Integration of model soup selection logic directly into the FL training loop (interleaved with communication rounds).

Modeling

Base Model: ResNet-18

Training Method: Federated Learning with selective weight interpolation (FedSoup)

Objective Functions:

Purpose: Minimize classification error.

Formally: Cross-Entropy Loss.

Training Data:

Camelyon17 (Pathology): 4,600 images, 5 sources (clients)
Retinal Fundus: 1,264 images, 4 sources (institutions/clients)
75% training, 25% testing split per client

Key Hyperparameters:

learning_rate: 1e-3
optimizer: Adam
batch_size: 16
+ 4 more
momentum_coefficients: 0.9, 0.99
communication_rounds: 1000
local_epochs: 1
interpolation_start_epoch: 75% of total training epochs (default)

Compute: Not reported in the paper

Comparison to Prior Work

vs. FedAvg: FedSoup uses historical model averaging and selection, not just the latest global model.
vs. FedProx: FedProx constrains updates via a loss term; FedSoup constrains/guides via direct weight interpolation.
vs. FedRep/FedBABU: These split the architecture (body vs. head); FedSoup operates on the full model weights via interpolation.
+ 1 more
vs. SWA (Stochastic Weight Averaging) [not cited in paper as baseline, but methodology source]: SWA averages all weights; FedSoup selectively averages based on validation performance in an FL context.

Limitations

Requires storage of multiple model checkpoints (ingredients) on the client side, though the paper claims to average them on the fly.
Effectiveness depends on the 'interpolation start epoch' hyperparameter.
Evaluation limited to two medical image classification tasks (ResNet-18 backbone only).

Reproducibility

Code: https://github.com/ubc-tea/FedSoup

Code is publicly available at https://github.com/ubc-tea/FedSoup. Datasets are public (Camelyon17, Retinal Fundus sources). Hyperparameters are explicitly listed. Evaluation involves 5-fold cross-validation.

📊 Experiments & Results

Evaluation Setup

Medical Image Classification in a Cross-Silo FL setting.

Benchmarks:

Camelyon17 (Pathology) (Tumor classification (5 clients))
Retinal Fundus (Glaucoma classification (4 clients))

Metrics:

Local Performance (Accuracy, AUC) on client's local test set
Global Performance (Accuracy, AUC) on held-out joint distribution
Unseen Domain Generalization (AUC)
Statistical methodology: 5-fold leave-one-client-data cross-validation, 3 repetitions. Standard deviation reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Retinal Fundus Results: FedSoup achieves the best balance, winning decisively on Global Performance metrics while maintaining top-tier Local Performance.
Retinal Fundus	Global Accuracy	90.25	90.92	+0.67
Retinal Fundus	Global AUC	95.49	96.00	+0.51
Retinal Fundus	Local AUC	85.09	86.24	+1.15
Pathology Results: FedSoup significantly improves global generalization compared to personalization-focused methods like FedRep and FedBN.
Pathology	Global Accuracy	70.18	72.87	+2.69
Pathology	Global AUC	79.27	81.45	+2.18
Unseen Domain Generalization: FedSoup shows superior ability to generalize to completely new clients/domains not seen during training.
Pathology	AUC	76.76	79.63	+2.87

Experiment Figures

Sharpness quantification using Hessian Eigenvalues on the retina dataset.

Trade-off analysis: Local vs Global accuracy at different personalization levels (fine-tuning epochs).

Main Takeaways

FedSoup effectively navigates the personalization-generalization trade-off, whereas methods like FedRep improve local accuracy but harm global generalization.
Sharpness quantification (Hessian eigenvalues) confirms FedSoup finds flatter minima compared to FedAvg and FedProx, explaining the improved generalization.
The method is particularly effective on the smaller Retinal Fundus dataset, suggesting strong capabilities in mitigating overfitting on limited data.

📚 Prerequisite Knowledge

Prerequisites

Federated Learning (FedAvg, Cross-silo settings)
Stochastic Gradient Descent (SGD)
Loss Landscape Geometry (Sharp vs. Flat Minima)
Model Interpolation / Weight Averaging

Key Terms

Federated Model Soup: An ensemble method in FL where weights of multiple historical global models are averaged to create a robust model.

Flat Minima: A region in the loss landscape where the loss function is relatively constant around the minimum; models in flat minima generalize better than those in sharp minima.

PFL: Personalized Federated Learning—techniques to adapt a global FL model to specific local client distributions.

Model Patching: A fine-tuning technique that interpolates weights between a reference model (global soup) and a target model (local) to retain general capabilities while adapting.

SWA: Stochastic Weight Averaging—an optimization technique that averages model weights along the trajectory of SGD to find flatter minima.

Cross-silo FL: FL setting where clients are typically organizations (e.g., hospitals) with moderate amounts of data and reliable connections, as opposed to millions of mobile devices (cross-device).

OOD: Out-of-Distribution—data samples that differ in distribution from the training data (e.g., images from a different hospital).

Hessian Eigenvalue: A mathematical measure of the curvature of the loss function; higher values indicate sharper curvature (worse generalization).