RLHF: Reinforcement Learning from Human Feedback—aligning AI models using rewards derived from human preferences
PPO: Proximal Policy Optimization—an RL algorithm using clipped updates to ensure training stability
KL divergence: A statistical measure of how one probability distribution differs from another; used here to prevent the model from drifting too far from its original training
SFT: Supervised Fine-Tuning—training models on high-quality examples before applying RL
Reward Hacking: When a model exploits loopholes in the reward function to get high scores without actually improving performance
DAP: Direct Alignment from Preference—methods like DPO that align models directly on preference pairs without an explicit reward model loop
GRPO: Group Relative Policy Optimization—a recent RLHF baseline method
RLOO: REINFORCE Leave-One-Out—an online alignment algorithm using leave-one-out baselines
Weighted Regression: An optimization approach where the model is trained to maximize the likelihood of samples weighted by their quality (advantage)
Trust Region: The area of the policy space close to a reference policy where updates are considered safe and stable