RLHF: Reinforcement Learning from Human Feedback—aligning models using preference data
DPO: Direct Preference Optimization—an algorithm that optimizes a policy directly from preferences without an explicit reward model
XPO: Exploratory Preference Optimization—the proposed algorithm that adds an exploration bonus to DPO
KL regularization: A penalty term that keeps the learned policy close to a reference policy to prevent mode collapse or safety issues
Bellman error: The difference between the current value estimate and the value estimate after taking a step and observing the reward; minimizing this is central to many RL algorithms
Global Optimism: An exploration strategy where the agent acts according to a hypothesis that is optimistic about the potential rewards in unexplored regions
Contextual Bandit: A simplified RL setting with a single step (state -> action -> reward), often used to model RLHF prompts and responses
Online Exploration: actively collecting new data during training by interacting with the environment (or human/AI labeler) rather than just using a static dataset