Federated Reasoning Distillation Framework with Model Learnability-Aware Data Allocation

📝 Paper Summary

Federated Learning Knowledge Distillation Large Language Models (LLMs)

LaDa optimizes federated collaboration between a large server model and small client models by dynamically filtering training data based on learnability gaps and aligning reasoning patterns via contrastive distillation.

Core Problem

Existing federated large-small model collaborations suffer from a bidirectional learnability gap where models cannot identify which samples effectively transfer knowledge, and reasoning transfer methods fail to adapt to local domain distributions.

Why it matters:

Small models (SLMs) have limited capacity and cannot absorb all knowledge from a Large Language Model (LLM), making indiscriminate distillation inefficient.
LLMs struggle to identify which samples from SLMs provide novel domain knowledge versus redundant information.
Standard supervised fine-tuning for reasoning transfer overfits to specific demonstrations rather than learning generalizable reasoning patterns adaptable to local data.

Concrete Example: A client-side 1.5B GEMMA model fails to learn effectively from a server-side 70B LLaDa because it cannot identify samples matching its capacity, actually achieving better results when distilling from a smaller 13B LLaDa. Conversely, the 70B model cannot distinguish which SLM samples offer novel domain knowledge.

Key Novelty

Federated Reasoning Distillation with Learnability-Aware Data Allocation (LaDa)

Introduces a Model Learnability-Aware Data Filter that uses an exploration-exploitation strategy to select high-reward samples tailored to the specific capacity gap between each LLM-SLM pair.
Proposes Domain Adaptive Reasoning Distillation, which aligns the joint probabilities of reasoning paths between models using a contrastive objective (similar to DPO) rather than simple supervised fine-tuning.

Architecture

The overall architecture of LaDa, illustrating the interaction between the Server (LLM) and Client (SLM). It shows the Data Filter selecting samples from the Public Dataset and the bidirectional flow of reasoning paths (y_s, y_k) used for the Domain Adaptive Reasoning Distillation loss.

Evaluation Highlights

Achieves up to 13.8% accuracy improvement over state-of-the-art baselines across four collaborative scenarios on MATHInstruct and GSM8K datasets.
Outperforms FedMKT by +4.3% accuracy on the GSM8K dataset in the Standard Collaboration scenario.
Demonstrates convergence guarantees with a rate of O(1/√T) for the collaboration framework enhanced with LaDa modules.

Breakthrough Assessment

8/10

Identifies and addresses the 'bidirectional learnability gap' in heterogeneous federated learning, a nuanced problem often overlooked. The solution combines RL-based data filtering with contrastive reasoning distillation, showing strong empirical gains.

⚙️ Technical Details

Problem Definition

Setting: Federated large-small model collaboration with one server LLM and K client SLMs sharing a public/synthetic dataset.

Inputs: Public dataset D_p, local private datasets D_k, server model f_psi, client models g_phi_k

Outputs: Optimized server LLM and client SLMs

Pipeline Flow

Data Filtering (Server & Client): Select high-reward samples from public dataset
Reasoning Generation: LLM and SLM generate reasoning paths for selected samples
Distillation: Server and Client update models using Domain Adaptive Reasoning Distillation loss

System Modules

Model Learnability-Aware Data Filter

Selects samples that maximize the expected reduction in distillation loss for a specific model pair

Model or implementation: Neural network with ExploitNet (prediction) and ExploreNet (uncertainty) heads

Domain Adaptive Reasoning Distillation

Aligns reasoning capabilities between LLM and SLM using contrastive learning on generated reasoning paths

Model or implementation: LLM (Server) and SLM (Client)

Novel Architectural Elements

Dual-head Data Filter network (ExploitNet + ExploreNet) integrated into the federated workflow to dynamically select distillation samples based on estimated learnability rewards

Modeling

Base Model: Varies (e.g., LLaDa-8B, Qwen2.5-7B, Gemma-2B)

Training Method: Federated Knowledge Distillation with Contrastive Learning

Objective Functions:

Purpose: Train the data filter to predict sample rewards.

Formally: MSE loss between predicted reward and observed reward (calculated from loss reduction).
Purpose: Align SLM/LLM reasoning patterns.

Formally: Contrastive loss L_mard = -log(sigma(beta * (log(pi(y_s|x)/pi_ref(y_s|x)) - log(pi(y_k|x)/pi_ref(y_k|x)))))

Training Data:

Datasets: MATHInstruct, GSM8K
Public dataset serves as the distillation medium

Key Hyperparameters:

beta: 0.1 (regularization strength in distillation loss)
lambda: 0.1 (exploration strength)
alpha: 0.5 (balance between round and batch rewards)
+ 1 more
learning_rate: Not explicitly reported in the paper for main training, only for filter updates

Compute: Experiments use NVIDIA A800 and A100 GPUs.

Comparison to Prior Work

vs. FedMKT: LaDa uses a learnability-aware filter with exploration-exploitation rather than just loss-based filtering, and transfers reasoning paths via contrastive loss rather than just logits.
vs. DS: LaDa dynamically adapts filtering based on the changing learnability gap, whereas DS uses static confidence thresholds.
vs. rStar [not cited in paper]: rStar focuses on reasoning with generator-discriminator consistency, whereas LaDa focuses on federated transfer of reasoning capabilities across heterogeneous models.

Limitations

Relies on a public or synthetic dataset being available to all clients.
Computationally intensive due to the need for a separate filter network and iterative reasoning generation.
The exploration-exploitation strategy introduces additional hyperparameters (alpha, lambda) that may need tuning.

Reproducibility

Code: https://github.com/GUoGUoWi/LaDa

Code is publicly available at https://github.com/GUoGUoWi/LaDa. The paper specifies the datasets (MATHInstruct, GSM8K) and model families (Qwen, Gemma, LLaDa, InternLM) used. Hyperparameters for the filter and loss function are provided.

📊 Experiments & Results

Evaluation Setup

Federated learning with heterogeneous models across clients.

Benchmarks:

MATHInstruct (Mathematical Reasoning)
GSM8K (Grade School Math)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Standard Collaboration Scenario: One server LLM (LLaDa-8B) and multiple client SLMs (Qwen2.5-7B, Gemma-2B, InternLM-7B, etc.). LaDa consistently outperforms baselines.
GSM8K	Accuracy	45.1	49.4	+4.3
MATHInstruct	Accuracy	31.4	33.8	+2.4
Ablation studies demonstrate the contribution of each module (Filter and Reasoning Distillation).
GSM8K	Accuracy	47.2	49.4	+2.2
GSM8K	Accuracy	46.5	49.4	+2.9

Experiment Figures

Preliminary verification of the bidirectional learnability gap. (a) SLM performance vs. LLM scale: SLM performance doesn't strictly increase with larger teacher LLMs. (b) LLM performance vs. SLM configuration: LLM benefits vary depending on the SLM it learns from.

Main Takeaways

LaDa consistently improves accuracy across diverse LLM-SLM pairs and datasets compared to state-of-the-art federated distillation methods.
The learnability-aware data filter is crucial, as removing it leads to a significant performance drop, confirming the importance of selective knowledge transfer.
The contrastive reasoning distillation method effectively transfers reasoning capabilities, outperforming standard supervised fine-tuning or logit-based distillation.
The framework is robust across different model scales (1.5B to 8B) and architectures (Qwen, Gemma, LLaDa).

📚 Prerequisite Knowledge

Prerequisites

Federated Learning (FL) concepts
Knowledge Distillation (KD)
Reinforcement Learning (exploration-exploitation)
Direct Preference Optimization (DPO)

Key Terms

SLM: Small Language Model—models with fewer parameters (e.g., 1.5B, 7B) deployed on clients

LLM: Large Language Model—models with many parameters (e.g., 70B) deployed on the server

Learnability Gap: The disparity between a teacher model's knowledge complexity and a student model's capacity to absorb it

DPO: Direct Preference Optimization—a method for aligning language models to preferences using a contrastive loss without a separate reward model

UCB: Upper Confidence Bound—an algorithm used in multi-armed bandit problems to balance exploration and exploitation

ExploitNet: A neural network component of the filter that predicts sample rewards based on historical data

ExploreNet: A neural network component of the filter that estimates uncertainty to encourage exploration of new samples

CoT: Chain-of-Thought—a prompting technique that encourages models to generate intermediate reasoning steps

SFT: Supervised Fine-Tuning—training a model on labeled examples