RL's Razor: Why Online Reinforcement Learning Forgets Less

📝 Paper Summary

Catastrophic forgetting Post-training optimization

Reinforcement learning inherently forgets less than supervised fine-tuning because its on-policy nature biases updates toward solutions with minimal KL divergence from the original model.

Core Problem

Fine-tuning foundation models on new tasks causes catastrophic forgetting, where performance on previously learned capabilities degrades significantly.

Why it matters:

Models deployed as long-term agents must continually adapt to new needs without losing prior knowledge
Current mitigation strategies address symptoms (e.g., weight constraints) rather than the underlying cause of forgetting
SFT is shown to erase prior knowledge even when achieving similar new-task performance to RL

Concrete Example: When fine-tuning a Qwen 2.5 3B model on math reasoning, SFT improvements on the new task cause a sharp reduction in prior-task performance (e.g., MMLU, HumanEval), whereas RL improves the new task while keeping prior benchmarks nearly unchanged.

Key Novelty

RL's Razor / Empirical Forgetting Law

Identifies that the degree of forgetting is accurately predicted by the KL divergence between the fine-tuned and base policy on the new task alone
Explains RL's advantage as an implicit bias: on-policy sampling restricts updates to high-probability regions of the base model, naturally finding KL-minimal solutions among equally valid ones
Proposes that forgetting is governed by the solution distribution found, not the optimization algorithm itself—demonstrated by an 'oracle SFT' that minimizes KL and outperforms RL

Architecture

Conceptual illustration of 'RL's Razor' and the performance trade-off. Left: Solution space showing RL finding a solution closer to the initialization (Base Policy) compared to SFT. Right: Pareto frontiers showing RL maintaining higher Previous Task Performance for a given New Task Performance.

Evaluation Highlights

On Math reasoning tasks with Qwen 2.5 3B, RL achieves high new-task accuracy with minimal degradation on prior tasks, while SFT shows a steep drop in prior performance for the same gains
In a controlled ParityMNIST setting, KL divergence predicts forgetting with R²=0.96 across both RL and SFT methods
Oracle SFT (trained on analytically KL-minimal labels) retains more prior knowledge than standard RL, confirming KL minimization is the mechanism behind reduced forgetting

Breakthrough Assessment

9/10

Establishes a fundamental empirical law connecting forgetting to KL divergence on the new task, offering a unifying explanation for why RL outperforms SFT in continual learning.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning a pre-trained policy π₀ on a new task distribution τ to obtain π

Inputs: New task prompts/inputs x ∼ τ

Outputs: Model responses/actions y

Pipeline Flow

Input (New Task Prompts)
Policy Model (LLM or Robot Policy)
Output (Response/Action)

System Modules

Policy Model

Generate responses or actions for the new task

Model or implementation: Qwen 2.5 3B-Instruct (LLM) or OpenVLA 7B (Robotics)

Modeling

Base Model: Qwen 2.5 3B-Instruct (LLM), OpenVLA 7B (Robotics), 3-layer MLP (ParityMNIST)

Training Method: Reinforcement Learning (GRPO) vs. Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: SFT minimizes negative log-likelihood of fixed target labels.

Formally: Standard cross-entropy loss.
Purpose: RL maximizes reward (binary success) using on-policy samples.

Formally: GRPO objective (without explicit KL regularization in experiments).

Adaptation: Full fine-tuning (implied by context of foundation models)

Training Data:

Math: Open-Reasoner-Zero dataset
Science: SciKnowEval (Chemistry L-3)
Tool Use: ToolAlpaca
Robotics: SimplerEnv (pick can task)

Compute: Not reported in the paper

Comparison to Prior Work

vs. EWC/LwF: This paper does not propose a new regularizer but identifies KL on the *new* task as the predictor of forgetting, explaining why RL naturally forgets less without explicit constraints
vs. Lai et al. (2025): Contradicts their claim that RL's advantage comes from negative examples; shows instead that on-policy sampling leads to KL-minimal solutions

Limitations

Experiments primarily rely on the correlation between KL and forgetting; full causal mechanism is theoretically analyzed only in simplified settings
RL training on large LLMs is computationally expensive compared to SFT
Did not evaluate on extremely large-scale models (>70B parameters) where forgetting dynamics might differ

Reproducibility

Code availability is not explicitly provided in the paper text (website link is to a blog post). Datasets (Open-Reasoner-Zero, SciKnowEval, ToolAlpaca, SimplerEnv) and base models (Qwen 2.5, OpenVLA) are public.

📊 Experiments & Results

Evaluation Setup

Fine-tuning on a single new task and evaluating on a suite of prior benchmarks to measure forgetting

Benchmarks:

New Task Benchmarks (Domain-specific performance)
Prior Task Benchmarks (General capabilities (Hellaswag, MMLU, etc.))

Metrics:

New Task Accuracy
Previous Tasks Performance (aggregated score)
KL Divergence (on new task)
Statistical methodology: Pareto frontier analysis over diverse hyperparameters

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ParityMNIST experiments establish the strong correlation between KL divergence and forgetting in a controlled setting.
ParityMNIST (Toy Setting)	R² (Forgetting vs. KL)	Not applicable	0.96	Not applicable
LLM experiments demonstrate that KL predicts forgetting in large-scale settings as well.
LLM Tasks (Combined)	R² (Forgetting vs. KL)	Not applicable	0.71	Not applicable

Experiment Figures

Pareto frontiers of New Task Performance vs. Previous Tasks Performance for LLM (Math, Science, Tool Use) and Robotics tasks.

ParityMNIST results: (Left) Performance trade-off for SFT, RL, and Oracle SFT. (Middle) Forgetting plotted against KL divergence showing a unified curve.

Main Takeaways

RL consistently retains more prior knowledge than SFT for the same level of new-task performance across Math, Science, Tool Use, and Robotics domains
The degree of catastrophic forgetting is accurately predicted by the KL divergence between the fine-tuned and base policy evaluated *only* on the new task
SFT can converge to solutions arbitrarily far from the base model (high KL) depending on labels, whereas on-policy RL is implicitly biased toward solutions close to the base model (low KL)
An 'Oracle SFT' trained on labels that minimize KL divergence outperforms standard RL, proving that the solution distribution (KL-minimal), not the algorithm, is the key factor

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (policy gradients, on-policy vs. off-policy)
Supervised Fine-Tuning (SFT)
KL Divergence
Catastrophic Forgetting

Key Terms

SFT: Supervised Fine-Tuning—training a model on a fixed dataset of input-output pairs

RL: Reinforcement Learning—training a model to maximize a reward signal, often by exploring and refining its own generations

KL divergence: Kullback-Leibler divergence—a statistical measure of how one probability distribution differs from a second, reference distribution

On-policy: Learning from data generated by the current version of the model itself during training

Pareto frontier: The set of optimal trade-off points where improving one metric (e.g., new task accuracy) is impossible without degrading another (e.g., prior task performance)

Oracle SFT: A theoretical SFT baseline trained on labels that provably minimize KL divergence to the base model while maximizing accuracy

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used for fine-tuning LLMs