RLHF: Reinforcement Learning from Human Feedback—a two-stage process of training a reward model on preferences, then optimizing a policy using RL (e.g., PPO)
DPO: Direct Preference Optimization—a method optimizing the policy directly from preferences by treating the implicit reward as a function of the policy, bypassing explicit reward modeling
Realizability: The assumption that the true function (reward or optimal policy) exists within the chosen family of models (e.g., neural networks of a certain size)
Model Mis-specification: The scenario where the true function (reward or policy) cannot be perfectly represented by the model class, leading to approximation errors
Online DPO: A variant of DPO where preference data is generated on-the-fly by the current policy rather than being fixed offline
Isomorphic: In this context, meaning the reward model class and policy model class have equivalent representational capacity (one can be mapped to the other)
PILAF Sampler: A specific sampling strategy for Online DPO that mixes standard sampling with importance sampling based on reward differences to better approximate the objective