A Quantitative Characterization of Forgetting in Post-Training

📝 Paper Summary

Continual Learning Generative Model Post-Training

Theoretical analysis using mixture models proves that forward-KL objectives (SFT) inherently cause mass collapse of old knowledge, whereas reverse-KL (RL) preserves it, with forgetting limited only by distribution overlap.

Core Problem

Post-training procedures like SFT and RL induce catastrophic forgetting, but the specific mechanisms (weight collapse vs. parameter drift) are not theoretically quantified or distinguished.

Why it matters:

Practitioners use SFT and RL interchangeably for fine-tuning without understanding why SFT destroys old capabilities while RL might preserve them
Existing definitions of forgetting conflate 'ignoring the old task' (mass collapse) with 'corrupting the old task' (drift), hindering the design of targeted remedies

Concrete Example: When fine-tuning a model on new data only, an SFT objective (Forward-KL) forces the model to assign zero probability to old data regions ($eta o 0$) because the objective never sees them, even if the model is capable of representing both perfectly.

Key Novelty

Two-Mode Mixture Theory of Forgetting

Models the learner as a two-component mixture (Old vs. New) to analytically separate 'Mass Forgetting' (assigning zero weight to old tasks) from 'Old-Component Drift' (distorting old parameters)
Proves that Forward-KL (SFT) on new data inherently minimizes to zero old-task weight, while Reverse-KL (RL) updates preserve the weight and only drift due to distribution overlap (misassignment probability)

Breakthrough Assessment

8/10

Provides a rigorous theoretical foundation for a widely observed phenomenon (SFT forgets more than RL). The decomposition into mass collapse vs. drift is a valuable conceptual tool.

⚙️ Technical Details

Problem Definition

Setting: Continual learning of a new distribution $p_n$ while retaining an old distribution $p_o$, modeled as a mixture estimation problem

Inputs: Old distribution $p_o$, New distribution $p_n$, Target mixture $p_{\alpha} = \alpha p_o + (1-\alpha) p_n$

Outputs: Learned model $q_{\beta} = \beta q_o + (1-\beta) q_n$

Pipeline Flow

Theoretical Abstraction (not a software pipeline)
Old Component (Learned previously)
New Component (To be learned)
Mixture Weight (Optimization variable)

System Modules

Old Component ($q_o$) (Mixture Model)

Represents the previously learned behavior/distribution

Model or implementation: Gaussian Distribution $\mathcal{N}(\mu_o, \Sigma)$

New Component ($q_n$) (Mixture Model)

Represents the new behavior to be acquired

Model or implementation: Gaussian Distribution $\mathcal{N}(m_n, \Sigma)$

Mixture Weight ($eta$) (Mixture Model)

Controls the probability mass allocated to the old behavior versus the new behavior

Model or implementation: Scalar parameter $\in [0, 1]$

Modeling

Base Model: Two-component Gaussian Mixture Model (Theoretical abstraction)

Training Method: Analysis of Gradient Flow on KL Objectives

Objective Functions:

Purpose: SFT / Maximum Likelihood on new data.

Formally: Forward-KL $\min KL(p_{new} || q_{\beta})$
Purpose: RL / On-policy matching to target.

Formally: Reverse-KL $\min KL(q_{\beta} || p_{target})$

Key Hyperparameters:

delta: Separation between old and new mode means $||\mu_n - \mu_o||_{\Sigma^{-1}}$
alpha: Target mixture weight for the old component

Compute: Not applicable (Theoretical paper)

Comparison to Prior Work

vs. SDFT: SDFT behaves like reverse-KL with an evolving teacher; avoids drift if demonstrator is strong
vs. TTT-Discover: TTT is intrinsically mode-seeking and can collapse mass without a strong KL anchor
vs. OAPL: Uses exponential tilt of a frozen reference; updates are geometrically local with overlap-controlled influence

Limitations

Analysis relies on a simplified two-mode Gaussian mixture abstraction rather than full neural network dynamics
Assumes the model family is expressive enough to represent the target (real networks have capacity constraints)
Focuses on population-level objectives, though finite-batch effects (like starvation) are discussed for replay

Reproducibility

Theoretical paper. Mathematical proofs and derivations constitute the reproducibility artifacts (contained in the paper and appendices).

📊 Experiments & Results

Evaluation Setup

Theoretical analysis of gradient flow dynamics on convex mixture landscapes

Metrics:

Mixture weight convergence ($\beta^*$)
Parameter drift magnitude ($||m_o - \mu_o||$)
Statistical methodology: Mathematical proof

Main Takeaways

Forward-KL (SFT) on new data is strictly increasing in the old-mixture weight $\beta$, causing the unique population minimizer to be $\beta^*=0$ (total mass forgetting).
Reverse-KL (RL) admits an exact decomposition where old-mode updates are driven only by 'misassignment probabilities' (overlap), decaying exponentially with mode separation $\delta$.
Replay fundamentally differs by objective: for SFT, it modifies the population optimum (forcing $\beta^* > 0$); for RL, it fixes stochastic variance (preventing 'old-mode starvation') without changing the population target.
Algorithmic analysis: SDFT acts as reverse-KL to an evolving teacher; TTT-Discover is mode-seeking and risks collapse without anchors; OAPL uses exponential tilting to preserve modes present in the reference.

📚 Prerequisite Knowledge

Prerequisites

Kullback-Leibler (KL) Divergence (Forward vs. Reverse)
Gaussian Mixture Models
Supervised Fine-Tuning (SFT) vs. Reinforcement Learning (RL) objectives

Key Terms

Forward-KL: The divergence $KL(P_{data} || Q_{model})$, typically minimized in Maximum Likelihood Estimation and Supervised Fine-Tuning (SFT)

Reverse-KL: The divergence $KL(Q_{model} || P_{target})$, typically minimized in Reinforcement Learning (RL) and on-policy distribution matching

Mass Forgetting: A form of catastrophic forgetting where the model assigns zero probability weight ($eta=0$) to the old task/mode

Old-Component Drift: A form of forgetting where the model retains weight on the old mode, but the parameters (e.g., mean) shift away from the correct old distribution

Bhattacharyya coefficient: A statistical measure of the amount of overlap between two statistical samples or populations

SFT: Supervised Fine-Tuning—training a model to maximize the likelihood of a fixed dataset

SDFT: Self-Distillation Fine-Tuning—a method analyzed in the paper

OAPL: On-Policy Alignment from Partial Lagged references—a method analyzed in the paper

TTT-Discover: Test-Time Training Discover—a method analyzed in the paper