RLHF: Reinforcement Learning from Human Feedback—a method to align models using a reward model trained on human preferences
PFT: Preference Fine-Tuning—fine-tuning models to generate outputs preferred by humans or scalers
DPO: Direct Preference Optimization—an offline method optimizing policy to satisfy preferences without an explicit reward model loop
PPO: Proximal Policy Optimization—an online RL algorithm often used in the second stage of RLHF
MLE: Maximum Likelihood Estimation—standard supervised learning objective maximizing the probability of data
Generation-Verification Gap: The concept that it is often computationally easier to verify a good solution (reward model) than to generate one (policy)
Proper Learning: Learning a hypothesis from a restricted class (e.g., policies optimal for some reward model) rather than any arbitrary function
Isomorphic Classes: When the set of functions representable by the policy class is mathematically equivalent to the set representable by the reward model class