PACED: Distillation at the Frontier of Student Competence

📝 Paper Summary

Knowledge Distillation Curriculum Learning LLM Reasoning

Paced improves LLM distillation efficiency and stability by weighting training examples using a Beta kernel derived from the student's pass rate, concentrating effort on problems at the frontier of competence.

Core Problem

Standard distillation spreads compute uniformly, wasting resources on mastered problems (vanishing gradients) and intractable ones (high-variance, incoherent gradients that cause forgetting).

Why it matters:

Uniform training forces models to process samples that provide no signal or actively harmful noise, slowing convergence
High-variance gradients from intractable problems erode existing capabilities, leading to catastrophic forgetting on benchmarks like MMLU
Curriculum learning usually relies on static difficulty heuristics, but true difficulty depends on the specific student's evolving state

Concrete Example: In standard distillation, a student might be forced to train on a math problem it has 0% chance of solving. The resulting gradients are large but directionally random, damaging the weights established for easier problems. Conversely, training on a problem it solves 100% of the time yields near-zero gradients, wasting compute.

Key Novelty

Proficiency-Adaptive Competence Enhanced Distillation (Paced)

Dynamically weights each training example based on the student's pass rate p using a Beta kernel w(p) = p^α(1-p)^β
Derives this specific weight shape theoretically by proving that gradient Signal-to-Noise Ratio (SNR) naturally vanishes at both p=0 and p=1
Combines forward KL (mode coverage) and reverse KL (mode consolidation) in a two-stage schedule to maximize reasoning performance while minimizing forgetting

Architecture

Overview of the Paced framework illustrating the selection of training samples based on the Zone of Proximal Development.

Evaluation Highlights

+14.8% accuracy on AIME 2025 for Qwen3-8B distilled from Qwen3-14B using Paced (forward KL)
+16.7% accuracy on AIME 2025 using a two-stage schedule (forward then reverse KL), demonstrating synergy between coverage and consolidation
Maintains MMLU forgetting at just 0.2%, significantly better than standard uniform baselines

Breakthrough Assessment

8/10

Strong theoretical grounding for curriculum design combined with impressive empirical gains on hard reasoning tasks and minimal forgetting. The derivation from SNR boundary conditions distinguishes it from heuristic weighting.

⚙️ Technical Details

Problem Definition

Setting: Knowledge distillation (training student S_θ to mimic teacher T) on reasoning problems

Inputs: Reasoning problem x

Outputs: Solution sequence y (either from teacher T or sampled from student S_θ depending on KL direction)

Pipeline Flow

Pass Rate Estimation: Sample K rollouts from student -> Compute pass rate p
Weight Calculation: Compute weight w(p) using Beta kernel
Teacher Generation: Teacher generates reference solution y_T (conditioned on expert solution if available)
Distillation Update: Update student S_θ using weighted loss w(p) * L_distill

System Modules

Pass Rate Estimator (Curriculum Construction)

Estimate student competence on current problem

Model or implementation: Student Model (S_θ)

Weight Calculator (Curriculum Construction)

Assign scalar weight to the loss for problem x

Model or implementation: Analytical Function

Reference Generator

Generate target reasoning chain

Model or implementation: Teacher Model T (e.g., Qwen3-14B)

Student Learner

Update parameters to minimize weighted divergence from teacher

Model or implementation: Student Model S_θ (e.g., Qwen3-8B)

Novel Architectural Elements

Pass-rate-dependent loss weighting mechanism integrated directly into the distillation loop
Dual-track capability: unified framework supporting both coverage (Forward KL) and consolidation (Reverse KL) phases

Modeling

Base Model: Qwen3-8B (Student) and Qwen3-14B (Teacher) for distillation; Qwen2.5-Math-7B-Instruct (Student/Teacher) for self-distillation

Training Method: Weighted Knowledge Distillation

Objective Functions:

Purpose: Distillation (Forward KL).

Formally: Sum_t D_KL(p_T(.|y_{T,<t}) || p_S(.|y_{T,<t}))
Purpose: Self-Distillation (Reverse KL).

Formally: Sum_t D_KL(p_S(.|y_{S,<t}) || p_T(.|y_{S,<t}))
Purpose: Paced Objective.

Formally: Expected value over dataset of [w(p(x)) * L_distill(x)]

Adaptation: Full fine-tuning

Training Data:

Disjoint splits D_dist and D_self

Key Hyperparameters:

beta_kernel_alpha: 1 (default)
beta_kernel_beta: 1 (default)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard Distillation: Paced uses non-uniform weights w(p) derived from SNR, ignoring mastered/intractable samples.
vs. ACE: Paced operates at the problem level for distillation, whereas ACE operates at the rollout level for RL.
vs. GKD/SDFT: Paced is a weighting framework applicable to these methods, not an alternative loss; it adds the curriculum dimension.
+ 1 more
vs. Importance Sampling [not cited in paper]: Importance sampling typically weights by gradient norm to reduce variance in SGD; Paced weights by pass-rate-derived SNR to target learnability and reduce forgetting.

Limitations

Requires estimating pass rates via rollouts, which adds computational overhead before training
Pass rates are estimated once (single-pass) in main experiments, potentially becoming stale as the student improves
Relies on the availability of a stronger teacher or expert solutions to guide the reference generation

Reproducibility

Method relies on student rollouts to estimate pass rates. No explicit code URL provided in the text. Teacher models are from the Qwen family. Expert solutions for teacher conditioning obtained via black-box API (e.g., gpt-oss-120b).

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks

Benchmarks:

MATH-500 (Mathematics problems)
AIME 2024 (Math competition problems)
AIME 2025 (Math competition problems)
MMLU (General knowledge / Catastrophic forgetting)

Metrics:

Accuracy / Pass Rate
Forgetting (drop in MMLU performance)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Distillation results (Qwen3-14B -> Qwen3-8B) showing Paced outperforms base model and standard distillation baselines.
MATH-500	Accuracy gain	0.0	7.5	+7.5
AIME 2025	Accuracy gain	0.0	14.8	+14.8
MMLU	Forgetting	0.0	-0.2	-0.2
Self-Distillation results (Qwen2.5-Math-7B-Instruct) using Reverse KL.
MATH-500	Accuracy gain	0.0	9.8	+9.8
AIME 2025	Accuracy gain	0.0	13.6	+13.6
Two-stage synergy (Forward KL -> Reverse KL) results.
AIME 2025	Accuracy gain	0.0	16.7	+16.7
AIME 2024	Accuracy gain	0.0	15.2	+15.2

Experiment Figures

Prompt template for the teacher model.

Main Takeaways

Paced achieves significant gains on reasoning benchmarks (AIME, MATH-500) while nearly eliminating catastrophic forgetting on MMLU compared to uniform baselines.
The theoretical Beta kernel w(p)=p(1-p) is robust and effective without hyperparameter tuning, validating the SNR boundary analysis.
Forward KL (mode coverage) and Reverse KL (mode consolidation) act synergistically: a two-stage schedule yields the best overall performance.
Filtering out 'mastered' and 'intractable' problems is structurally inevitable for efficiency, as gradients in these regions provably vanish or become incoherent.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Distillation (Forward and Reverse KL)
Gradient Signal-to-Noise Ratio (SNR)
Curriculum Learning
Beta Distribution / Kernel

Key Terms

pass rate: The probability p that the student model generates a correct solution for a given problem x, estimated via K rollouts

forward KL: Minimizing KL(P_Teacher || P_Student); forces the student to cover all modes of the teacher's distribution (preventing mode collapse but potentially including low-probability tails)

reverse KL: Minimizing KL(P_Student || P_Teacher); forces the student to focus on high-probability modes of the teacher (seeking mode seeking/consolidation)

Beta kernel: A weight function w(p) = p^α(1-p)^β that peaks at intermediate pass rates and vanishes at 0 and 1, matching the theoretical SNR profile of distillation gradients

zone of proximal development: The set of problems where the student is neither fully incompetent nor fully masterful, representing the most efficient training signal

minimax-robust: A guarantee that the worst-case efficiency loss is bounded even if the true SNR profile deviates from the assumed Beta model

gradient signal-to-noise ratio (SNR): The ratio of the squared norm of the expected gradient to the trace of the gradient covariance matrix; a measure of learning efficiency

SFT: Supervised Fine-Tuning—standard training on labeled data (often hard labels)

MMLU: Massive Multitask Language Understanding—a benchmark measuring general knowledge across many subjects, used here to measure catastrophic forgetting

MATH-500: A benchmark dataset of 500 mathematics problems used to evaluate reasoning capability

AIME: American Invitational Mathematics Examination—a challenging math competition used as a benchmark