RLHF: Reinforcement Learning with Human Feedback—using human preferences to train a reward model that guides LLM generation
RLVR: Reinforcement Learning with Verifiable Rewards—using programmatic checkers (like compilers or math verifiers) to provide binary rewards
Bradley-Terry model: A statistical model used in RLHF to estimate the probability that one response is better than another based on pairwise comparisons
Reward Hacking: When an RL agent exploits flaws in the reward function to get high scores without actually achieving the intended goal (e.g., writing very long but empty answers)
Entailment task: A classification task determining if a hypothesis (here, 'response satisfies principle') is true given a premise
HelpSteer3-Feedback: An open-source dataset containing prompts, responses, and textual human feedback used to extract principles in this paper
KTO: Kahneman-Tversky Optimization—a method using binary (good/bad) signals for alignment without pairwise preferences