DPO: Direct Preference Optimization—an offline method optimizing a policy directly from preference data without a separate reward model
RLHF: Reinforcement Learning from Human Feedback—a two-stage process of learning a reward model and then optimizing a policy via online RL (like PPO)
Global Coverage: A strong condition where the offline dataset distribution covers the entire support of the policy space
Local Coverage: A weaker condition where the offline dataset only needs to cover the policies within a certain KL-divergence ball of the reference policy
Reverse KL: KL(π || π_ref) — measures divergence where the expectation is taken over the learned policy π; requires online sampling to estimate
HyPO: Hybrid Preference Optimization—the proposed algorithm mixing offline contrastive loss with an online KL regularization term
PPO: Proximal Policy Optimization—a standard online RL algorithm used in RLHF
Function Approximation: Using a parameterized model (like a neural network) to estimate values for unseen states, essential for generalization in partial coverage settings