Bradley-Terry (BT) model: A statistical model that estimates the probability of one item being preferred over another based on the difference in their underlying reward scores
RLHF: Reinforcement Learning from Human Feedback—a technique to align LLMs with human intent using a reward model to guide generation
DPO: Direct Preference Optimization—a method to optimize policies directly from preferences without an explicit reward model
ArmoRM: A specific open-source reward model used in this paper to score and filter synthetic data pairs
Magpie: A synthetic data generation method that leverages LLMs' autoregressive nature to generate user queries and responses
Adversarial examples: Inputs designed to trick a model; in this context, harmful prompts paired with compliance (bad) vs. refusal (good) responses
Focal loss: A loss function that down-weights easy examples to focus training on hard-to-classify examples
Hinge loss: A loss function that penalizes the model only if the margin between correct and incorrect class scores is less than a threshold
HelpSteer2: A compact, high-quality preference dataset with multi-attribute annotations like helpfulness and correctness