RLHF: Reinforcement Learning from Human Feedback—aligning models using a reward model trained on human preferences
RLAIF: Reinforcement Learning from AI Feedback—using an AI system instead of humans to generate preference labels for alignment
PPO: Proximal Policy Optimization—an on-policy RL algorithm used to optimize the LLM against the reward model
DPO: Direct Preference Optimization—an algorithm that optimizes the policy directly from preference data without training an explicit reward model
SFT: Supervised Fine-Tuning—the initial phase of training where the model learns to follow instructions from labeled examples
Bradley-Terry model: A statistical model for estimating the probability that one item is preferred over another based on their latent scores
KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a reference distribution, used to prevent the aligned model from drifting too far from the base model
Implicit Reward Model: A reward function that is mathematically derived from the optimal policy itself (as in DPO), bypassing the need for a separate reward network
Pointwise Reward: A single scalar score assigned to a specific prompt-response pair
Listwise Feedback: Feedback where a labeler ranks a list of K responses rather than just comparing a pair
Off-policy RL: Learning from data generated by a previous version of the policy (or a different policy entirely), rather than the current policy being trained