DPO: Direct Preference Optimization—an algorithm optimizing language models to prefer 'chosen' over 'rejected' responses without an explicit reward model
SFT: Supervised Finetuning—training the model to maximize the likelihood of ground-truth responses
eNTK: Empirical Neural Tangent Kernel—a matrix representing how much updating the model on one example affects the prediction on another example based on gradient similarity
squeezing effect: A phenomenon where negative gradients on unlikely classes in softmax models force probability mass into the single highest-probability class, often leading to repetitive or degenerate output
learning dynamics: The study of how model predictions change step-by-step during training as a function of optimization updates
off-policy: Training on data generated by a different policy (e.g., a static dataset) rather than the model currently being trained
on-policy: Training on data generated by the current version of the model itself
hallucination: Model generation of incorrect or non-factual information, specifically analyzed here as 'facts from question B answering question A'
teacher forcing: A training technique where the model is fed the ground-truth previous token as input for the next step, rather than its own generation
residual term: The vector difference between the current prediction and the target, determining the direction of the gradient update (denoted as G_t in the paper)
logits: The raw, unnormalized scores output by the neural network before the softmax layer