RLHF: Reinforcement Learning from Human Feedback—a technique to fine-tune models using a reward signal derived from human preferences
SFT: Supervised Fine-Tuning—training a model on a dataset of high-quality human demonstrations (prompts and desired responses)
RM: Reward Model—a model trained to predict which of two outputs a human would prefer
PPO: Proximal Policy Optimization—an RL algorithm that updates the policy to maximize reward while limiting how much the policy changes in one step
PPO-ptx: A variant of PPO that adds a pretraining loss term to the objective to prevent performance regression on public NLP tasks (alignment tax)
Alignment tax: The cost in performance on specific public NLP tasks (like SQuAD or translation) that comes from aligning the model to human preferences
Hallucination: When a model generates information that is factually incorrect or not present in the source input
Prompt: The input text given to a language model to elicit a response
Labeler: A human contractor who writes demonstrations or ranks model outputs
Win rate: The percentage of time one model's output is preferred over another model's output by human judges
KL penalty: Kullback-Leibler divergence penalty—added to the reward to prevent the RL policy from drifting too far from the initial supervised model