Forward-KL: The divergence $KL(P_{data} || Q_{model})$, typically minimized in Maximum Likelihood Estimation and Supervised Fine-Tuning (SFT)
Reverse-KL: The divergence $KL(Q_{model} || P_{target})$, typically minimized in Reinforcement Learning (RL) and on-policy distribution matching
Mass Forgetting: A form of catastrophic forgetting where the model assigns zero probability weight ($eta=0$) to the old task/mode
Old-Component Drift: A form of forgetting where the model retains weight on the old mode, but the parameters (e.g., mean) shift away from the correct old distribution
Bhattacharyya coefficient: A statistical measure of the amount of overlap between two statistical samples or populations
SFT: Supervised Fine-Tuning—training a model to maximize the likelihood of a fixed dataset
SDFT: Self-Distillation Fine-Tuning—a method analyzed in the paper
OAPL: On-Policy Alignment from Partial Lagged references—a method analyzed in the paper
TTT-Discover: Test-Time Training Discover—a method analyzed in the paper