On-Policy Context Distillation for Language Models

📝 Paper Summary

Memory internalization Knowledge distillation On-policy learning

On-Policy Context Distillation (OPCD) enables language models to internalize context by training on their own generated trajectories while minimizing reverse KL divergence against a context-conditioned teacher.

Core Problem

Existing context distillation methods rely on off-policy training with forward KL divergence, which causes exposure bias (mismatch between teacher-forced training and student generation) and mode-covering behavior (hallucinations).

Why it matters:

In-context knowledge is transient and lost when the context resets, requiring costly re-processing of prompts or retrieved documents
Standard off-policy distillation suffers from exposure bias, where students fail to correct their own errors during inference
Forward KL minimization forces students to cover the teacher's entire distribution, leading to broad, disjointed outputs when the student lacks the teacher's capacity

Concrete Example: When distilling a math solution trace into a student, standard methods force the student to mimic the exact teacher tokens. If the student deviates slightly, it doesn't learn how to recover. OPCD lets the student generate its own solution attempts and corrects them based on the teacher's feedback.

Key Novelty

On-Policy Context Distillation (OPCD)

Trains the student model on its own generated trajectories (on-policy) rather than fixed teacher data, ensuring it learns to recover from its own states
Uses reverse KL divergence to encourage mode-seeking behavior, making the student focus on the teacher's high-likelihood regions rather than trying to cover the entire complex distribution
Applies this to 'Experiential Knowledge Distillation', where models solve problems, extract lessons, and then internalize those lessons permanently

Architecture

The training loop where the student generates a trajectory y from x, and the teacher evaluates [c; x; y] to provide supervision via reverse KL.

Evaluation Highlights

+10-15% accuracy gains on DAPO-Math-17K compared to standard context distillation baselines
Achieves ~2% higher accuracy on out-of-distribution benchmarks (IF-Eval) compared to off-policy baselines, indicating reduced catastrophic forgetting
Enables effective cross-size distillation where a 1.7B student successfully internalizes knowledge from an 8B teacher, whereas direct context injection fails

Breakthrough Assessment

8/10

Significantly improves upon standard context distillation by addressing fundamental optimization issues (exposure bias, mode-averaging). The application to 'experiential knowledge'—learning from self-generated traces—is a compelling path for self-improving LLMs.

⚙️ Technical Details

Problem Definition

Setting: Distilling a specific context c into student parameters θ such that the student mimics a teacher π_teacher(·|c, x) without seeing c

Inputs: Input x (without context c)

Outputs: Generated response y that mimics the behavior of π_teacher(y|c, x)

Pipeline Flow

Student Generation (samples trajectory y given x)
Teacher Evaluation (computes probability of y given c + x)
Loss Computation (Reverse KL on student's generated tokens)

System Modules

Student Model

Generates response trajectory y based only on input x

Model or implementation: Qwen3 (1.7B, 4B, 8B) or Llama-3 (various sizes)

Teacher Model

Provides target probability distribution conditioned on context c

Model or implementation: Same as student (self-distillation) or larger model (teacher-student)

Novel Architectural Elements

Integration of on-policy sampling loop into the context distillation pipeline (typically off-policy)
Experiential Knowledge accumulation workflow: Solve -> Extract -> Accumulate -> Distill

Modeling

Base Model: Qwen3 (1.7B, 4B, 8B), Qwen2.5 (3B, 7B), Llama-3.1/3.2

Training Method: On-Policy Context Distillation (OPCD)

Objective Functions:

Purpose: Align student generation with teacher distribution using reverse KL.

Formally: L(θ) = E_{x~D, y~π_θ(·|x)} [ sum_t KL_reverse( π_θ(·|x, y_<t) || π_teacher(·|c, x, y_<t) ) ]
Purpose: Approximate the analytic KL divergence efficiently.

Formally: Summation restricted to Top-k tokens predicted by the student

Training Data:

DAPO-Math-17K (English math problems)
TextArena (Frozen Lake, Sokoban) game traces
MedMCQA (Medical system prompts)
Tweet Eval/Hatecheck/Ethos (Safety system prompts)
Experiential knowledge pool: 300 contexts accumulated from validation splits

Key Hyperparameters:

batch_size: 128
training_steps: 50
max_response_length_math: 16384 tokens
+ 2 more
max_response_length_games: 1024 tokens (per round, up to 5 rounds)
top_k_approximation: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Context Distillation (Standard): OPCD uses on-policy generation + reverse KL to fix exposure bias and hallucinations
vs. Generalized KD [not cited in paper]: Applies specifically to 'internalizing' a transient context c, rather than general teacher mimicry
vs. Direct Context Injection: OPCD compresses knowledge into weights, avoiding inference latency and context window limits; outperforms injection for small students

Limitations

Computational overhead of on-policy sampling during training (student must generate full trajectories)
Requires a capable teacher model (or self-model) that can effectively utilize the context
Experiential knowledge extraction relies on the model's ability to self-reflect, which may be limited in smaller models

Reproducibility

Code availability is not provided. Datasets (DAPO-Math, TextArena, MedMCQA) are public. Detailed prompt templates for experiential knowledge extraction/accumulation are in Appendix A.1.

📊 Experiments & Results

Evaluation Setup

Distilling experiential knowledge (Math, Games) and system prompts (Medical, Safety) into student models.

Benchmarks:

DAPO-Math-17K (Mathematical reasoning)
TextArena (Frozen Lake, Sokoban) (Text-based reinforcement learning / reasoning games)
MedMCQA / Safety Benchmarks (Domain-specific QA and Safety classification)
IF-Eval (Instruction Following (used for OOD evaluation))

Metrics:

Task Accuracy
Strict Accuracy (IF-Eval for OOD)
Prompt-level Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experiential Knowledge Distillation results showing OPCD outperforms standard context distillation and often even the 'In-Context' baseline (where knowledge is in the prompt).
DAPO-Math-17K	Accuracy	45.0	51.1	+6.1
IF-Eval (Frozen Lake OOD)	Strict Accuracy	See Table 2	See Table 2	+2.0
Medical Test (OOD for Safety Task)	Accuracy	See Figure 3	See Figure 3	+4.0

Main Takeaways

OPCD consistently outperforms standard off-policy context distillation across Math, Games, and System Prompt tasks.
On-policy training significantly reduces catastrophic forgetting, maintaining higher OOD performance (e.g., on IF-Eval) compared to off-policy baselines.
Teacher-Student distillation is more stable and effective than Self-Distillation, especially for experiential knowledge.
Directly injecting experiential knowledge into a small student's context can degrade performance; distilling it via OPCD is superior.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Distillation (Teacher-Student frameworks)
Kullback-Leibler (KL) Divergence (Forward vs. Reverse)
In-Context Learning
On-Policy vs. Off-Policy Learning

Key Terms

OPCD: On-Policy Context Distillation—the proposed method that trains a student on its own generations to minimize reverse KL against a context-aware teacher

Reverse KL Divergence: An objective function (KL(Student || Teacher)) that penalizes the student for generating samples unlikely under the teacher, encouraging mode-seeking behavior

Forward KL Divergence: The standard objective (KL(Teacher || Student)) used in most distillation, which penalizes the student for missing parts of the teacher's distribution, often causing mode-covering (broad/blurry) outputs

Exposure Bias: The error accumulation that occurs when a model is trained on ground-truth/teacher trajectories but generates its own tokens autoregressively at test time

Experiential Knowledge Distillation: A process where a model solves problems, extracts lessons (experiences) from its traces, accumulates them into a context, and then distills that context into its weights

System Prompt Distillation: Compressing the behavioral instructions of a system prompt (e.g., 'You are a medical expert') into the model's weights so the prompt isn't needed at inference

Mode-seeking: A behavior where a model focuses on the most likely output (peak) of the target distribution rather than trying to cover all possibilities

Top-k approximation: Approximating the sum over the entire vocabulary by summing only the top-k most probable tokens to make KL calculation computationally feasible