SLiC: Sequence Likelihood Calibration—a method to align the model's assigned probabilities with a ranking of candidate sequences
RLHF: Reinforcement Learning from Human Feedback—fine-tuning models to maximize a reward signal derived from human preferences
PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm used in RLHF to update the policy while preventing drastic deviations
SFT: Supervised Fine-Tuning—the initial training phase using ground truth labels, used as a starting point for alignment
Off-policy: Learning from data generated by a different policy (model) than the one currently being trained
Ranking Model: A model trained to output which of two candidate summaries is better (pairwise), rather than assigning a single score
Reward Model: A model trained to assign a scalar score to a single summary (pointwise)
TL;DR: Too Long; Didn't Read—a dataset of Reddit posts and their user-written summaries
ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a standard metric for summarization based on n-gram overlap with a reference