RLHF: Reinforcement Learning from Human Feedback—a method to align AI models with human values using preference data.
DPO: Direct Preference Optimization—an algorithm that optimizes the policy directly from preference data without training a separate reward model.
PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm used to update the policy while preventing drastic changes.
Reverse-KL: A specific direction of KL divergence (KL(π || π_ref)) used to regularize the policy π towards a reference π_ref, common in generative modeling to ensure diversity.
Contextual Bandit: A simplified reinforcement learning setting where the agent observes a state (context), takes an action, and receives a reward, but does not transition to a new state based on that action.
RSO: Rejection Sampling Optimization—a baseline method that samples multiple outputs and trains on the best ones.
Bradley-Terry model: A statistical model for predicting the outcome of a pairwise comparison (e.g., preference between two model outputs).