FedCP: Separating Feature Information for Personalized Federated Learning via Conditional Policy

📝 Paper Summary

Personalized Federated Learning (pFL) Conditional Computing in FL

FedCP uses a lightweight Conditional Policy Network to dynamically separate global and personalized information within feature vectors for each sample, processing them through distinct global and personalized heads.

Core Problem

Existing pFL methods treat model parameters as the sole unit of personalization, neglecting that the underlying data features themselves contain a mix of global and personalized information.

Why it matters:

Treating all features uniformly fails to capture fine-grained distinctions in non-IID data, leading to suboptimal performance on local clients.
Simply fine-tuning a global model or regularizing local training does not explicitly disentangle shared knowledge from client-specific nuances at the data level.

Concrete Example: In an image dataset, a 'dog' (global concept) might appear on a 'pink rug' (rare/personalized context). Traditional pFL methods process the entire image feature through one model, whereas FedCP separates the 'dog' features for a global head and 'pink rug' features for a personalized head.

Key Novelty

Sample-Specific Feature Separation via Conditional Policy

Introduces a Conditional Policy Network (CPN) that acts like a dynamic router, generating a soft mask for every input sample to split its features into global and personalized components.
Uses a dual-head architecture where 'global' features are processed by a frozen global head (to preserve shared knowledge) and 'personalized' features by a trainable local head.
Generates the routing policy based on both the sample itself and a client-specific embedding derived from the local personalized head.

Architecture

The complete forward pass and data flow of FedCP on a client.

Evaluation Highlights

Outperforms state-of-the-art method Ditto by +6.69% accuracy on Cifar100 in practical non-IID settings.
Achieves superior stability in scenarios where clients accidentally drop out, maintaining ~54% accuracy on Cifar100 while baselines like pFedMe drop significantly.
Incurs only ~4.67% additional parameters per client compared to ResNet-18, making it communication-efficient.

Breakthrough Assessment

8/10

Significantly shifts the pFL focus from model-level to feature-level personalization. The explicit separation mechanism yields substantial gains on hard tasks (Cifar100) and robustness to client dropout.

⚙️ Technical Details

Problem Definition

Setting: Personalized Federated Learning with N clients having non-IID private datasets D_i.

Inputs: Input sample x_i

Outputs: Prediction y_i, computed by summing outputs from a global head and a personalized head.

Pipeline Flow

Feature Extraction (Personalized Backbone)
Policy Generation (CPN)
Feature Separation (Masking)
Dual Head Processing (Global & Personalized Heads)
Aggregation (Summation)

System Modules

Personalized Feature Extractor

Maps input data to a feature vector, updated locally but aligned to global extractor via MMD

Model or implementation: ResNet-18 or 4-layer CNN (initialized globally, trained locally)

Conditional Policy Network (CPN)

Generates soft masks r_i (global) and s_i (personalized) to split feature information

Model or implementation: FC layer + LayerNorm + ReLU + Softmax

Global Head (Classification)

Processes global feature information; kept frozen during local training to preserve shared knowledge

Model or implementation: Fully Connected Layer

Personalized Head (Classification)

Processes personalized feature information; trained locally to fit specific client data

Model or implementation: Fully Connected Layer

Novel Architectural Elements

Parallel processing branches: one frozen global branch and one trainable personalized branch for features identified by CPN.
CPN-based dynamic routing dependent on both sample features and client-specific embeddings.

Modeling

Base Model: ResNet-18 (for Tiny-ImageNet) and 4-layer CNN (for MNIST/Cifar)

Training Method: Federated Learning with local personalization updates (SGD)

Objective Functions:

Purpose: Optimize classification accuracy.

Formally: Cross-entropy loss on sum of global and personalized head outputs.
Purpose: Align personalized features with global features.

Formally: MMD loss between output of personalized feature extractor and frozen global feature extractor.

Key Hyperparameters:

learning_rate: 0.005 (CNN), 0.1 (ResNet-18)
batch_size: 10
local_epochs: 1
+ 2 more
client_joining_ratio: 1.0 (default)
mmd_lambda: Determined by ablation (values 1-50 tested)

Compute: Adds 0.527M parameters (CPN) per client for ResNet-18 (~4.67% overhead).

Comparison to Prior Work

vs. FedRep: FedCP separates features per-sample rather than just splitting the model layers.
vs. FedRoD: FedRoD's dual heads compete; FedCP's heads cooperate via explicit routing policies.
vs. Ditto: FedCP operates at feature granularity using conditional computing, Ditto operates at parameter granularity.
+ 1 more
vs. ConvNet-AIG: Adapts features for centralized learning; FedCP adapts feature routing for federated personalization [not cited in paper]

Limitations

Requires setting hyperparameter lambda for MMD loss, which is sensitive to heterogeneity levels.
Per-sample policy generation adds slight computational overhead during inference compared to static networks.
Evaluated primarily on classification tasks; applicability to other domains (e.g., generation) not tested.

Reproducibility

Code: https://github.com/TsingZ0/FedCP

Code is publicly available. Datasets (MNIST, Cifar10/100, Tiny-ImageNet, AG News) are standard public benchmarks. Hyperparameters for baselines and the method are detailed.

📊 Experiments & Results

Evaluation Setup

Image and text classification under pathological (label partition) and practical (Dirichlet distribution) non-IID settings.

Benchmarks:

MNIST (Image Classification)
Cifar10 (Image Classification)
Cifar100 (Image Classification)
Tiny-ImageNet (Image Classification)
AG News (Text Classification)

Metrics:

Test Accuracy
Statistical methodology: Report mean and standard deviation over 5 runs.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on practical non-IID settings (Dirichlet beta=0.1) shows FedCP consistently outperforming baselines, especially on harder tasks.
Cifar100	Test Accuracy	52.87	59.56	+6.69
Tiny-ImageNet (ResNet-18)	Test Accuracy	39.95	44.18	+4.23
AG News	Test Accuracy	96.28	96.78	+0.50
Robustness to client dropout (simulating unstable mobile networks) shows FedCP maintains performance while others degrade.
Cifar100	Test Accuracy	44.43	54.20	+9.77
Scalability experiments varying client numbers show FedCP scaling better than baselines.
Cifar100	Test Accuracy	30.24	35.87	+5.63

Experiment Figures

Grad-CAM visualizations of what the global head vs. personalized head focuses on.

Evolution of the Personalization Identification Ratio (PIR) during training.

Main Takeaways

FedCP consistently outperforms SOTA pFL methods (FedRep, Ditto, FedRoD) across varying degrees of data heterogeneity (beta=0.01 to 1.0).
The method is highly robust to client dropouts, maintaining high accuracy even when participation fluctuates randomly, unlike regularization-based methods.
Ablation studies confirm that both the CPN (Conditional Policy Network) and the feature alignment (MMD loss) are critical; removing CPN causes a ~3% accuracy drop.
Visualizations (Grad-CAM) confirm the dual heads specialize: the global head focuses on background/generic features (sky, grass), while the personalized head focuses on specific objects/colors.

📚 Prerequisite Knowledge

Prerequisites

Federated Learning (FedAvg algorithm)
Personalized Federated Learning (pFL)
Conditional Computing / Dynamic Routing
Maximum Mean Discrepancy (MMD) Loss

Key Terms

CPN: Conditional Policy Network—a small auxiliary network that generates separation policies (masks) for feature vectors.

pFL: Personalized Federated Learning—FL variants that train distinct models for each client rather than a single global model.

MMD: Maximum Mean Discrepancy—a statistical measure used here to align the distribution of features from the personalized extractor with those of the global extractor.

non-IID: Non-Independent and Identically Distributed data—data distributions vary across clients (e.g., different label skews).

Grad-CAM: Gradient-weighted Class Activation Mapping—a visualization technique to highlight which parts of an image a CNN focuses on.

feature extractor: The initial layers of a neural network (backbone) that map raw inputs to low-dimensional feature vectors.

head: The final layers (usually fully connected) of a neural network that map features to class predictions.