Entropy-Aware On-Policy Distillation of Language Models

📝 Paper Summary

Knowledge Distillation On-Policy Learning Reasoning

EOPD improves on-policy distillation by dynamically switching from reverse KL to forward KL when the teacher's entropy is high, preserving crucial diversity in complex reasoning tasks.

Core Problem

Standard on-policy distillation uses reverse KL divergence, a mode-seeking objective that forces the student to collapse onto a single path even when the teacher is uncertain.

Why it matters:

High-entropy tokens in reasoning tasks often represent valid alternative paths or necessary uncertainty; collapsing them degrades reasoning capability
Reverse KL provides unstable gradient signals when the teacher distribution is flat or multi-modal, preventing proper convergence
Current methods achieve efficiency but lose the distributional richness of the teacher, leading to worse performance on complex benchmarks

Concrete Example: In a toy experiment where a teacher has a high-entropy distribution (spread across multiple modes), a student trained with reverse KL exhibits unstable oscillating predictions (frequent top-1 changes) and fails to cover the valid modes, whereas the teacher retains 18.5% high-entropy tokens compared to the student's 6.8%.

Key Novelty

Entropy-Aware On-Policy Distillation (EOPD)

Dynamically switches the loss function based on the teacher's token-level entropy: uses standard reverse KL for confident low-entropy tokens to maintain efficiency
Activates forward KL for high-entropy tokens to force the student to cover the teacher's full distribution (mode-covering), preserving diversity where uncertainty exists

Architecture

The training loop for Entropy-Aware On-Policy Distillation

Evaluation Highlights

+5.05 Pass@8 improvement on average across six math benchmarks for Qwen3-4B-Base compared to standard on-policy distillation
+2.39 Pass@8 improvement for Qwen3-1.7B-Base on the same benchmarks
Retains significantly more high-entropy tokens than baselines, successfully transferring the teacher's uncertainty structure

Breakthrough Assessment

7/10

Solid methodological improvement identifying a specific failure mode of reverse KL (diversity collapse) and proposing a principled, effective fix. Consistent gains across model sizes.

⚙️ Technical Details

Problem Definition

Setting: On-policy knowledge distillation where a student model learns to mimic a teacher model's distribution while generating its own trajectories.

Inputs: Prompt q from dataset D

Outputs: Generated token sequence x

Pipeline Flow

Student Generation: Student policy generates response to prompt
Teacher Evaluation: Teacher computes probabilities and entropy for generated context
Loss Calculation: Select Reverse KL or Forward KL based on entropy threshold
Update: Update student policy via gradient descent

System Modules

Student Policy

Generate response tokens autoregressively

Model or implementation: Qwen3-Base (0.6B, 1.7B, or 4B)

Teacher Model

Provide target probability distribution and entropy measurements

Model or implementation: Qwen3-8B

Loss Switch

Determine which divergence to optimize based on teacher entropy

Model or implementation: Heuristic (Threshold τ)

Novel Architectural Elements

Entropy-conditioned loss function: The objective function dynamically changes per-token between Reverse KL and top-k restricted Forward KL based on teacher uncertainty

Modeling

Base Model: Qwen3-Base (various sizes)

Training Method: On-Policy Distillation with Entropy-Aware Objective

Objective Functions:

Purpose: Optimize student to match teacher.

Formally: L = L_OPD if H_teacher <= tau else L_OPD + beta * L_FKL
Purpose: Efficient mode-seeking learning (Base).

Formally: L_OPD = Clipped Reverse KL (PPO-style surrogate)
Purpose: Diversity-preserving mode-covering learning (Augmentation).

Formally: L_FKL = Sum(Teacher_prob * log(Teacher_prob / Student_prob)) computed over top-k tokens

Adaptation: Full fine-tuning

Training Data:

MATH dataset (for 0.6B/1.7B students)
DAPO dataset (for 4B student)

Key Hyperparameters:

batch_size: 128
mini_batch_size: 32
gradient_steps_per_iteration: 4
+ 2 more
rollout_temperature: 1.0
top_k_for_fkl: 16

Compute: Not reported in the paper

Comparison to Prior Work

vs. KD: EOPD generates data on-policy to avoid distribution mismatch, but brings back Forward KL selectively
vs. OPD: EOPD adds Forward KL term in high-entropy regions to prevent mode collapse, whereas OPD uses only Reverse KL
vs. GRPO: EOPD uses dense token-level signals from a teacher model rather than sparse outcome rewards

Limitations

Depends on a high-quality teacher model providing calibrated uncertainty estimates
Computing Forward KL requires top-k probs from student, which adds slight overhead over pure sampling-based Reverse KL
Requires tuning of entropy threshold and mixing coefficient

Reproducibility

Code availability is not explicitly provided in the text. Hyperparameters like batch size and top-k are provided. Exact values for the entropy threshold τ and weighting coefficient β are not explicitly listed in the main text snippet provided, though likely in appendix.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks with generated chain-of-thought

Benchmarks:

MATH500 (Math problem solving)
AIME24/25 (Math competition problems)
AMC23 (Math competition problems)
Minerva (Math reasoning)
OlympiadBench (Olympiad-level math)

Metrics:

Avg@8 (Average accuracy over 8 samples)
Pass@8 (Probability at least one of 8 samples is correct)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
EOPD consistently outperforms standard On-Policy Distillation (OPD) across different model scales on mathematical reasoning benchmarks.
Average of 6 Math Benchmarks	Pass@8	Not reported in the paper	Not reported in the paper	+1.37
Average of 6 Math Benchmarks	Pass@8	Not reported in the paper	Not reported in the paper	+2.39
Average of 6 Math Benchmarks	Pass@8	Not reported in the paper	Not reported in the paper	+5.05
Average of 6 Math Benchmarks	Avg@8	Not reported in the paper	Not reported in the paper	+1.16

Experiment Figures

Toy experiment results showing instability of Reverse KL under high teacher entropy

Histogram of token-level entropy for Teacher vs. Student (Standard OPD)

Main Takeaways

EOPD significantly improves Pass@8 performance, indicating better coverage of valid solutions (diversity) compared to baseline OPD
Analysis of token entropy shows EOPD preserves more high-entropy tokens than standard OPD, matching the teacher's uncertainty profile better
Improvements scale with model size, with the largest gains observed in the 4B parameter model (+5.05 Pass@8)
Reverse KL is shown to be unstable in high-entropy settings (toy experiment), validating the need for the Forward KL augmentation

📚 Prerequisite Knowledge

Prerequisites

Knowledge Distillation (KD)
Forward vs. Reverse KL Divergence
On-Policy Learning / Policy Gradient
PPO (Proximal Policy Optimization)

Key Terms

Reverse KL: A divergence measure (KL(Student || Teacher)) that penalizes the student for generating samples unlikely under the teacher; it is mode-seeking and ignores teacher modes the student doesn't visit

Forward KL: A divergence measure (KL(Teacher || Student)) that penalizes the student for assigning low probability to samples likely under the teacher; it is mode-covering and forces the student to match the full teacher distribution

Mode-seeking: Behavior where a model focuses on a single high-probability peak of a distribution, ignoring others

Mode-covering: Behavior where a model spreads its probability mass to cover all high-probability peaks of the target distribution

Entropy: A measure of uncertainty or randomness in a distribution; high entropy means probability is spread across many tokens

On-Policy Distillation: Training a student model using samples generated by the student itself (corrected by the teacher), rather than fixed offline data

Pass@k: An evaluation metric that counts a problem as solved if at least one correct solution is found among k generated samples