Evaluation Setup
Image and text classification under pathological (label partition) and practical (Dirichlet distribution) non-IID settings.
Benchmarks:
- MNIST (Image Classification)
- Cifar10 (Image Classification)
- Cifar100 (Image Classification)
- Tiny-ImageNet (Image Classification)
- AG News (Text Classification)
Metrics:
- Test Accuracy
- Statistical methodology: Report mean and standard deviation over 5 runs.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main comparison on practical non-IID settings (Dirichlet beta=0.1) shows FedCP consistently outperforming baselines, especially on harder tasks. |
| Cifar100 |
Test Accuracy |
52.87 |
59.56 |
+6.69
|
| Tiny-ImageNet (ResNet-18) |
Test Accuracy |
39.95 |
44.18 |
+4.23
|
| AG News |
Test Accuracy |
96.28 |
96.78 |
+0.50
|
| Robustness to client dropout (simulating unstable mobile networks) shows FedCP maintains performance while others degrade. |
| Cifar100 |
Test Accuracy |
44.43 |
54.20 |
+9.77
|
| Scalability experiments varying client numbers show FedCP scaling better than baselines. |
| Cifar100 |
Test Accuracy |
30.24 |
35.87 |
+5.63
|
Main Takeaways
- FedCP consistently outperforms SOTA pFL methods (FedRep, Ditto, FedRoD) across varying degrees of data heterogeneity (beta=0.01 to 1.0).
- The method is highly robust to client dropouts, maintaining high accuracy even when participation fluctuates randomly, unlike regularization-based methods.
- Ablation studies confirm that both the CPN (Conditional Policy Network) and the feature alignment (MMD loss) are critical; removing CPN causes a ~3% accuracy drop.
- Visualizations (Grad-CAM) confirm the dual heads specialize: the global head focuses on background/generic features (sky, grass), while the personalized head focuses on specific objects/colors.