SFT: Supervised Fine-Tuning—training a model to predict the next token in a provided dataset
RL: Reinforcement Learning—training a model to maximize a reward signal, often using on-policy data (data generated by the model itself)
DPO: Direct Preference Optimization—an offline RL method that optimizes a policy to prefer winning responses over losing ones
SimPO: Simple Preference Optimization—a simplified variant of preference optimization methods
CLL: Centered Log-Likelihood—the proposed metric (log p + Entropy) used to measure how well a token fits the model's current distribution
SNR: Signal-to-Noise Ratio—a measure used in Signal Detection Theory to quantify the discriminability between two distributions
On-policy data: Data generated by the model's current policy (its own distribution), as opposed to fixed external datasets
Catastrophic forgetting: The tendency of neural networks to lose previously learned knowledge when trained on new data
Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer