DPO: Direct Preference Optimization—a method to align language models to preferences without training an explicit reward model, using the policy itself to define the implicit reward
Reward Model (RM): A model trained to predict human preferences between text outputs, usually outputting a scalar score
RLHF: Reinforcement Learning from Human Feedback—a technique to fine-tune language models using reward signals derived from human preferences
Bradley-Terry model: A statistical model used to predict the outcome of a comparison between two items, typically used to convert pairwise preferences into scalar rewards
Policy: The language model being trained to generate text (as opposed to the reward model which judges text)
Chat Hard: A subset of RewardBench focusing on trick questions and subtle instruction following where rejected answers look plausible but are wrong
XSTest: A dataset used to test for exaggerated safety refusals (e.g., refusing to answer safe questions that look unsafe)
Prior Sets: A collection of existing test sets (Anthropic HH, SHP, Summarize) used as a baseline category in RewardBench