Split Federated Learning Architectures for High-Accuracy and Low-Delay Model Training

📝 Paper Summary

Split Federated Learning (SFL) Hierarchical Federated Learning

The paper proposes an accuracy-aware hierarchical Split Federated Learning algorithm that jointly optimizes model partitioning layers and client-to-aggregator assignments to minimize training delay while maintaining high model accuracy.

Core Problem

Existing Hierarchical Split Federated Learning (HSFL) schemes overlook how the selection of partitioning layers and client-to-aggregator assignments impacts model accuracy, often leading to suboptimal performance.

Why it matters:

Selecting a suboptimal cut layer can severely degrade accuracy (as shown with AlexNet/VGG-11), contradicting common assumptions that accuracy is invariant to partitioning.
Split Federated Learning suffers from high training delays due to backward locking (clients waiting for server) and straggler effects (fast clients waiting for slow ones).
Current approaches optimize delay or overhead but fail to jointly address the trade-off between minimizing delay and preserving accuracy in a hierarchical setting.

Concrete Example: In a standard SFL setup with AlexNet, selecting convolution layer 2 as the cut layer results in significantly lower accuracy compared to layer 5. Existing delay-minimizing algorithms might select layer 2 to save bandwidth, inadvertently crippling model performance.

Key Novelty

Accuracy-Aware Hierarchical Split Federated Learning with Local Loss (AA-HSFL-ll)

Formulates a joint optimization problem that simultaneously selects two partitioning layers (aggregator layer and cut layer) and assigns clients to local aggregators.
Introduces an accuracy-aware heuristic that first identifies candidate cut layers satisfying an accuracy threshold, then optimizes assignments within that subset to minimize round delay.
Combines local-loss learning (to mitigate backward locking) with hierarchical aggregation (to mitigate stragglers) in a unified framework.

Architecture

The 3-tier Hierarchical SFL architecture illustrating the partitioning of the model into Weak-side, Aggregator-side, and Server-side sub-models.

Evaluation Highlights

Improves accuracy by 3% compared to state-of-the-art SFL and HSFL schemes.
Reduces training delay by 20% and communication overhead by 50% relative to baselines.
Achieves near-optimal performance compared to exhaustive search while maintaining low computational complexity.

Breakthrough Assessment

7/10

Solid contribution to SFL by rigorously formulating the joint optimization of topology and partitioning. The empirical gain of 3% accuracy + 20% delay reduction is significant for distributed edge learning.

⚙️ Technical Details

Problem Definition

Setting: Hierarchical Split Federated Learning network with clients, local aggregators, and a central server over wireless links.

Inputs: Network graph with transmission rates, node computation throughputs, and model architecture.

Outputs: Optimal selection of aggregator layer h, cut layer v, and binary client-to-aggregator assignment variables x_{n,k,l}.

Pipeline Flow

Client (Weak-side FP) → Local Aggregator (Aggregator-side FP) → Server (Server-side FP & BP)
Local Aggregator (Local-loss calculation & Aggregator-side BP) → Client (Weak-side BP)

System Modules

Client

Trains the 'weak-side model' (layers 1 to h); performs forward pass and backward pass using gradients from aggregator.

Model or implementation: Partial DNN (layers 1...h)

Local Aggregator

Trains 'aggregator-side model' (layers h+1 to v); aggregates models from assigned clients; computes local loss at cut layer v to break dependency on server.

Model or implementation: Partial DNN (layers h+1...v)

Server

Trains 'server-side model' (layers v+1 to L); performs global aggregation of all sub-models.

Model or implementation: Partial DNN (layers v+1...L)

Novel Architectural Elements

Three-tier splitting architecture with flexible, jointly optimized split points (aggregator layer h and cut layer v).
Dynamic assignment of clients to local aggregators based on network and computation constraints.

Modeling

Base Model: AlexNet and VGG-11 (used in evaluation case studies)

Training Method: Accuracy-Aware Hierarchical Federated Learning with Local Loss (AA-HSFL-ll)

Objective Functions:

Purpose: Minimize training round delay subject to accuracy constraints.

Formally: min_{h, v, X} T_round(h, v, X) s.t. v in V* (accuracy-aware set), connectivity constraints.

Trainable Parameters: Full model split across three tiers

Training Data:

Public datasets (implied CIFAR-10/ImageNet based on models used, though specific dataset names not explicitly listed in text snippets provided)

Key Hyperparameters:

lambda: Fraction of clients operating as local aggregators (between 0 and 1)
local_aggregation_frequency: Every epoch
global_aggregation_frequency: Every round (E epochs)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard SFL: Introduces 3-tier architecture and joint optimization of splits/assignments.
vs. HSFL: Explicitly models accuracy as a function of the cut layer (unlike HSFL which assumes invariance) and optimizes for it.
vs. Local-loss SFL: Integrates local loss into a hierarchical topology, addressing both backward locking and straggler effects simultaneously.

Limitations

Relies on the assumption that network topology and transmission rates remain stable during training rounds.
Finding the optimal solution is NP-hard; the proposed solution is a heuristic.
Requires an initial phase or proxy task to determine the set of high-accuracy cut layers (V*).

Reproducibility

No code URL provided. Algorithms are described mathematically. Simulation parameters (models AlexNet/VGG-11) are mentioned but specific dataset details (e.g., CIFAR vs ImageNet) are not fully detailed in the provided text snippet.

📊 Experiments & Results

Evaluation Setup

Simulation of a wireless network with clients, aggregators, and a server.

Benchmarks:

AlexNet (Image Classification)
VGG-11 (Image Classification)

Metrics:

Test Accuracy (%)
Training Delay (time per round)
Communication Overhead (bytes)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Public datasets (AlexNet/VGG-11)	Accuracy	Not reported in the paper	Not reported in the paper	+3%
Public datasets	Training Delay	Not reported in the paper	Not reported in the paper	-20%
Public datasets	Communication Overhead	Not reported in the paper	Not reported in the paper	-50%

Experiment Figures

Test accuracy of AlexNet and VGG-11 over training epochs for different cut layer selections.

Main Takeaways

Cut layer selection is not accuracy-invariant; optimizing it yields significant accuracy gains (e.g., layer 5 vs layer 2 in AlexNet).
Jointly optimizing assignments and splitting layers significantly outperforms fixed or delay-only optimization strategies.
The proposed heuristic achieves near-optimal results compared to exhaustive search but with much lower computational cost.
The architecture effectively mitigates both backward locking (via local loss) and straggler effects (via hierarchical aggregation).

📚 Prerequisite Knowledge

Prerequisites

Split Learning (SL) and Federated Learning (FL) concepts
Backpropagation and gradient descent dynamics
NP-hardness and combinatorial optimization

Key Terms

SFL: Split Federated Learning—hybrid approach where models are split between client and server (SL) and aggregated periodically (FL).

HSFL: Hierarchical Split Federated Learning—adds an intermediate tier of 'local aggregators' between clients and the central server.

cut layer: The specific layer where a neural network is split between the client-side model and the server-side model.

local-loss: A technique where clients calculate gradients using an auxiliary loss function at the cut layer instead of waiting for the full backward pass from the server.

backward locking effect: Idle time where clients wait for the server to finish forward/backward propagation before they can update their local model.

straggler effect: Delay caused by faster nodes waiting for the slowest node (straggler) to complete its task in synchronous distributed training.

aggregator layer: An additional split point in HSFL separating the weak-side model (on client) from the aggregator-side model (on local aggregator).

BP: BackPropagation—the algorithm for calculating gradients in neural networks.