Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

📝 Paper Summary

Catastrophic Forgetting Post-training adaptation Alignment algorithms

Reinforcement learning minimizes catastrophic forgetting better than supervised fine-tuning because its use of on-policy data creates a mode-seeking objective that learns new tasks without disrupting the model's existing multi-modal prior knowledge.

Core Problem

Adapting language models to new tasks via Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL) often degrades existing capabilities (e.g., safety, general knowledge), a phenomenon known as catastrophic forgetting.

Why it matters:

Fine-tuning for instruction following or reasoning often unintentionally erodes safety guardrails (e.g., jailbreak resistance) or general world knowledge (e.g., MMLU performance)
Understanding which alignment algorithm (SFT vs. RL) causes less forgetting is critical for building robust post-training pipelines
Conventional wisdom regarding KL divergence (that mode-seeking behavior causes collapse) contradicts empirical findings that RL preserves knowledge better than SFT

Concrete Example: When a Large Language Model (LLM) is fine-tuned on a math dataset, SFT may aggressively redistribute probability mass to cover the math answers, causing the model to 'forget' how to refuse unsafe prompts or answer general knowledge questions, whereas RL preserves these capabilities.

Key Novelty

Forgetting mitigation via on-policy data (Retaining by Doing)

Demonstrates that RL's robustness stems from its 'mode-seeking' nature (Reverse KL), which anchors the model to new task modes without stealing probability mass from old knowledge modes in a multi-modal distribution
Identifies 'on-policy data' (sampling from the current model) as the practical driver of this robustness, rather than KL regularization or advantage estimation
Proposes that SFT can achieve similar robustness if it uses 'approximately on-policy' data (generated at the start of each epoch) instead of fixed off-policy data

Architecture

Illustration of Forward KL vs. Reverse KL dynamics on a multi-modal distribution.

Evaluation Highlights

In multi-modal Gaussian simulations, SFT (Forward KL) causes severe forgetting (0.12 drop in overlap area with prior knowledge) to reach 0.9 target gain, while RL (Reverse KL) keeps prior modes intact
RL consistently achieves high target task performance with negligible drops on non-target tasks across Llama-3 and Qwen-2.5 families, whereas SFT shows a steep tradeoff
Using approximately on-policy data in SFT significantly reduces forgetting compared to standard SFT, validating the theoretical insight

Breakthrough Assessment

8/10

Provides a counter-intuitive but theoretically grounded explanation for why RL preserves knowledge better than SFT, challenging conventional wisdom about KL divergence and offering a simple, actionable guideline (use on-policy data).

⚙️ Technical Details

Problem Definition

Setting: Post-training a Language Model policy on a target task while evaluating performance retention on non-target tasks

Inputs: Prompt x from target task distribution

Outputs: Response y generated by policy

Pipeline Flow

Prompt Sampling (x ~ Task)
Response Generation (y ~ Policy)
Reward Calculation / Labeling
Parameter Update (RL or SFT)

System Modules

Language Model Policy

Generate responses to prompts

Model or implementation: Llama-3 (1B/8B) or Qwen-2.5 (1.5B/7B)

Reward Function / Oracle

Assign correctness scores to generated responses

Model or implementation: Verifiable reward (0 or 1)

Modeling

Base Model: Llama-3.2-1B-Instruct, Llama-3.1-8B-Instruct, Qwen-2.5-1.5B-Instruct, Qwen-2.5-7B-Instruct

Training Method: Comparison of SFT, Self-SFT, and RL (GRPO)

Objective Functions:

Purpose: SFT minimizes Forward KL divergence.

Formally: L_SFT = sum( -pi*(y|x) * log(pi_theta(y|x)) )
Purpose: RL maximizes expected reward with KL regularization (equivalent to Reverse KL).

Formally: J_RL = E[r(x,y)] - beta * KL(pi_theta || pi_theta0)

Adaptation: Full fine-tuning

Training Data:

SFT: Uses responses from Llama-3.3-70B-Instruct as ground truth
Self-SFT: Uses correct responses generated by the initial model (filtered)
RL: Uses on-policy generations during training

Key Hyperparameters:

epochs: 2
reward_signal: Binary (1 for correct, 0 for incorrect)
beta (KL coef): Not explicitly reported in the paper (general variable beta > 0 mentioned)
+ 1 more
learning_rate: Varied (high vs low explored in analysis)

Compute: Not reported in the paper

Comparison to Prior Work

vs. SFT: RL uses on-policy data which leads to mode-seeking behavior, preserving prior knowledge better than the mode-covering nature of SFT
vs. Self-SFT: Self-SFT uses fixed data from the *initial* policy, which causes more forgetting than RL's dynamic on-policy data sampling
vs. PPO [not cited in paper]: Paper uses GRPO (Group Relative Policy Optimization) instead of PPO, but argues the findings generalize to RL's objective geometry

Limitations

The theoretical analysis relies on simplified Gaussian mixture models which may not perfectly capture the complexity of high-dimensional language model distributions.
Experiments focus on reasoning and knowledge tasks with verifiable rewards; applicability to open-ended creative generation is less explored.
The exact computational cost tradeoff of generating on-policy data vs. fixed SFT data is not quantified in detail.

Reproducibility

Code: https://github.com/princeton-pli/retaining-by-doing

Code is publicly available at https://github.com/princeton-pli/retaining-by-doing. The paper specifies exact model versions (Instruct variants) and datasets (IFEval, MMLU, Countdown). Exact hyperparameters like learning rates or batch sizes are discussed qualitatively (e.g., 'high learning rate') but specific values for main experiments are not in the provided text.

📊 Experiments & Results

Evaluation Setup

Post-training on one target task and evaluating degradation on multiple non-target tasks

Benchmarks:

IFEval (Instruction Following)
MMLU (General Knowledge)
Countdown (Arithmetic Reasoning)
MATH (Mathematics (Non-target))
WildJailbreak (Safety (Non-target))
WildGuardTest (Safety (Non-target))

Metrics:

Target Task Gain (change in accuracy on trained task)
Non-target Tasks Drop (average decrease in accuracy on untrained tasks)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Gaussian Mixture Model simulations reveal that while SFT (Forward KL) is better for uni-modal priors, RL (Reverse KL) is superior for preserving multi-modal priors (which represent LMs with existing knowledge).
Gaussian Mixture (Gain=0.9)	Non-target Drop	0.70	0.64	-0.06
Gaussian Mixture (Gain=0.9)	Old Mode Overlap Drop	0.00	0.12	+0.12

Experiment Figures

Scatter plots of Target Task Gain (x-axis) vs. Non-Target Tasks Drop (y-axis) for SFT, Self-SFT, and RL across different models and tasks.

Simulation results for Uni-modal vs. Multi-modal priors comparing Forward KL and Reverse KL.

Main Takeaways

RL is consistently more robust to catastrophic forgetting than SFT across model families (Llama, Qwen) and scales (1B to 8B).
The 'mode-seeking' nature of RL (Reverse KL) allows it to learn new task modes without disrupting the existing modes that represent prior knowledge, provided the prior is multi-modal.
The key driver of this robustness is the use of on-policy data; simply using 'approximately on-policy' data (sampled per epoch) in SFT can recover much of RL's benefit.
High learning rates in SFT are required for performance but cause severe forgetting; RL achieves high performance without this penalty.

📚 Prerequisite Knowledge

Prerequisites

KL Divergence (Forward vs. Reverse)
Reinforcement Learning (RL) vs. Supervised Fine-Tuning (SFT)
On-policy vs. Off-policy learning
Mode-seeking vs. Mode-covering behavior

Key Terms

SFT: Supervised Fine-Tuning—training a model to maximize the likelihood of ground-truth responses (equivalent to minimizing Forward KL)

RL: Reinforcement Learning—training a model to maximize a reward signal, often using on-policy generations (equivalent to minimizing Reverse KL)

Forward KL: Kullback-Leibler divergence direction KL(P_target || P_model), which forces the model to cover the entire target distribution (mode-covering)

Reverse KL: Kullback-Leibler divergence direction KL(P_model || P_target), which allows the model to focus on the highest probability regions of the target (mode-seeking)

On-policy data: Training data generated by the model currently being trained (used in RL), as opposed to fixed external data

GRPO: Group Relative Policy Optimization—an RL algorithm used for tasks with verifiable outputs that normalizes rewards within a group of generations

Catastrophic Forgetting: The tendency of neural networks to lose previously learned information upon learning new information

Mode-seeking: A property of probability distributions where the approximation focuses on one or few peaks (modes) of the target, ignoring others

Mode-covering: A property where the approximation stretches to cover the entire support of the target distribution, often averaging across modes

Self-SFT: A baseline method where the model is fine-tuned on its own correct generations produced by the initial policy (off-policy)