SFT: Supervised Fine-Tuning—training a model to maximize the likelihood of ground-truth outputs
DPO: Direct Preference Optimization—an alignment method that optimizes a policy to prefer chosen responses over rejected ones without a separate reward model
RL: Reinforcement Learning—training an agent (model) to maximize a reward signal
Oracle: A stronger teacher model (e.g., Gemini 2.5 Pro) or human used to correct the student model's errors
LCS: Longest Common Subsequence—a metric used to measure the similarity between two text sequences
Elastic Tether: The authors' term for the dynamic gradient scaling in reward-based objectives that vanishes as the model becomes confident, preventing over-optimization and forgetting
Pull-up effect: A phenomenon where increasing the probability of a correct response inadvertently increases the probability of similar but incorrect responses
BCE: Binary Cross Entropy—a loss function used for binary classification tasks
OOD: Out-of-Domain—tasks or data distributions not seen during training
KL divergence: A measure of how one probability distribution differs from another, used to constrain the model from drifting too far from its initial state