Personalization Disentanglement for Federated Learning: An explainable perspective

📝 Paper Summary

Personalized Federated Learning (PFL) Disentangled Representation Learning

FedDVA uses a dual-encoder Variational Autoencoder framework to explicitly disentangle shared global knowledge from client-specific personalized patterns in Federated Learning, improving both explainability and downstream task performance.

Core Problem

In Personalized Federated Learning (PFL), raw sample representations entangle universal knowledge with client-specific biases, making it difficult to effectively share global knowledge while retaining local personalization.

Why it matters:

Entangled representations hinder the efficient extraction and sharing of universal knowledge across the federation
Lack of disentanglement makes it hard to interpret what constitutes a client's specific personality or bias
Existing PFL methods focus on architecture or optimization (like fine-tuning) but neglect the fundamental representation perspective

Concrete Example: Consider handwritten digits on different clients where Client 1's images always have sinusoidal marks and Client 2's have elliptical marks. Standard FL mixes the digit features (universal) with the marks (personalized). FedDVA separates these so the model learns 'digit 7' globally while isolating 'sinusoidal mark' as a local style.

Key Novelty

Federated Dual Variational Autoencoder (FedDVA)

Deploys two separate encoders: one for universal latent representations (shared) and one for personalized latent representations (client-specific)
Uses a novel client-specific Evidence Lower Bound (ELBO) with a constraint that forces the personalized representation to be closer to the local client distribution than the global mixture distribution

Architecture

The FedDVA architecture showing the interaction between the Blue Encoder (Universal), Red Encoder (Personalized), and White Decoder.

Evaluation Highlights

Achieves higher classification accuracy than FedAvg, FedAvg+FineTuning, and Ditto on MNIST and CIFAR-10 under heterogeneous settings
Visualizations confirm clear manifold separation: changing the universal latent variable alters the object (e.g., face identity), while changing the personalized variable alters style (e.g., hairstyle/background)
Demonstrates faster convergence in communication rounds compared to vanilla Federated Learning baselines

Breakthrough Assessment

7/10

A solid methodological contribution applying disentanglement (VAEs) to PFL. While VAEs are established, the dual-encoder formulation for FL with specific regularization for personalization is a clever, explainable approach.

⚙️ Technical Details

Problem Definition

Setting: Federated Learning with $K$ clients, optimizing global parameters $\theta^*$ while handling local data distributions $D_k$. The goal is to learn disentangled representations $z$ (universal) and $c$ (personalized).

Inputs: Local private data samples $x$ distributed across clients

Outputs: Two latent representations: $z$ (universal knowledge) and $c$ (client-specific personalization), used for reconstruction or downstream classification

Pipeline Flow

Universal Encoder f(x) → infers q(z|x)
Personalized Encoder h(x, z) → infers q(c|x, z)
Client-specific Decoder g(z, c) → reconstructs x
Server aggregates Universal Encoder; Clients keep Personalized Encoder/Decoder local

System Modules

Universal Encoder (Encoding)

Infers the posterior for universal knowledge z, shared across the federation

Model or implementation: 4-layer CNN backbone + 2 FC embedding layers

Personalized Encoder (Encoding)

Infers the posterior for client-specific knowledge c, conditioned on x and z

Model or implementation: FC combination layer → 4-layer CNN backbone + 2 FC embedding layers

Local Decoder

Reconstructs the original input from the disentangled representations

Model or implementation: Reverse of encoding modules (transposed CNNs)

Novel Architectural Elements

Dual-encoder structure where one encoder is globally aggregated (Universal) and one is kept local (Personalized)
Sequential inference dependency: Personalized encoder h(x, z) takes the output of the Universal encoder as input

Modeling

Base Model: Custom CNN-based VAE architecture (4-layer CNNs)

Training Method: Federated Learning with alternating updates (Local Decoder update → Global/Local Encoder update)

Objective Functions:

Purpose: Optimize VAE reconstruction and disentanglement.

Formally: minimize Negative ELBO: -E[log p(x|z,c)] + alpha*KL(q(z|x)||p(z)) + beta*R_c(q(c|x,z))
Purpose: Enforce personalization constraint on c.

Formally: R_c uses a slack regularizer max(xi + KL(q||p_bar), KL(q||q_global)) to ensure local posterior is closer to local prior than global prior

Key Hyperparameters:

learning_rate: 0.001
batch_size: 256
communication_rounds: 200
+ 6 more
local_epochs: 5
alpha: 1
beta: 0.75
xi: 8 * dimension of c
latent_dim_z: 4 (reconstruction) / 8 (classification)
latent_dim_c: 4 (reconstruction) / 8 (classification)

Compute: Not reported in the paper

Comparison to Prior Work

vs. FedAvg/Ditto: FedDVA operates at the representation level via disentanglement rather than parameter regularization or fine-tuning
vs. Standard VAE in FL [not cited in paper]: FedDVA introduces a dual encoder with specific 'slack' regularization to separate client bias from global features explicitly

Limitations

Privacy risk: While representations are disentangled, the decoder (if leaked) could potentially reconstruct private data; however, the decoder is kept local.
Requires hyperparameter tuning for the constraint threshold (xi) which depends on latent dimensions.
Computation overhead of maintaining dual encoders compared to a single model.

Reproducibility

Code: https://github.com/pysleepy/FedDVA

Code is publicly available at https://github.com/pysleepy/FedDVA. Hyperparameters (alpha, beta, xi, batch size, LR) are specified in the Appendix. Dataset generation details for synthetic heterogeneity (marks on digits) are provided.

📊 Experiments & Results

Evaluation Setup

Federated Learning simulation with heterogeneous data distributions

Benchmarks:

MNIST (synthesized) (Image Reconstruction & Classification (Heterogeneous Inputs)) [New]
CelebA (Image Reconstruction (Attribute Bias))
CIFAR-10 (Image Classification (Heterogeneous Outputs))

Metrics:

Classification Accuracy
Reconstruction Quality (Visual)
Latent Space Visualization (t-SNE)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Classification accuracy results demonstrating FedDVA's performance against baselines on heterogeneous setups.
MNIST (Heterogeneous Inputs)	Accuracy	88.0	97.0	+9.0
MNIST (Heterogeneous Inputs)	Accuracy	95.0	97.0	+2.0
CIFAR-10	Accuracy	60.0	70.0	+10.0

Experiment Figures

Reconstruction of MNIST digits where columns vary 'c' (personalization) and rows vary 'z' (content).

t-SNE visualization of latent spaces z and c.

Main Takeaways

FedDVA effectively disentangles representations: Visualizations show that changing the universal variable 'z' changes content (digit/identity), while changing 'c' changes style (marks/attributes).
Converges faster than vanilla FedAvg and Ditto in communication rounds.
Maintains lower variance in accuracy across different clients compared to baselines (smaller shadow regions in plots).

📚 Prerequisite Knowledge

Prerequisites

Federated Learning (FedAvg algorithm)
Variational Autoencoders (VAE) and ELBO optimization
Kullback-Leibler (KL) divergence
Disentangled representation learning

Key Terms

ELBO: Evidence Lower Bound—the objective function used to train Variational Autoencoders, consisting of a reconstruction term and a regularization term

FedAvg: Federated Averaging—the standard algorithm for Federated Learning where client model updates are averaged by a central server

Disentanglement: Separating the factors of variation in data so that each dimension of the representation corresponds to a single generative factor (e.g., digit shape vs. writing style)

VAE: Variational Autoencoder—a generative model that learns to compress data into a latent space and reconstruct it

KL divergence: A statistical distance measure used to quantify how much one probability distribution differs from another; used here to regularize latent spaces

PFL: Personalized Federated Learning—a variation of FL where the goal is to train models customized for each client rather than a single global model