DPO: Direct Preference Optimization—an algorithm that fine-tunes language models on preference data by implicitly solving a reward maximization problem without an explicit reward model.
RLHF: Reinforcement Learning from Human Feedback—the standard two-stage pipeline of training a reward model and then optimizing a policy using PPO.
misspecified estimator: A statistical estimator that attempts to fit a model class that does not contain the true data-generating distribution.
implicit reward: The reward function defined mathematically by the ratio of the optimized policy to the reference policy in DPO (r = beta * log(pi/pi_ref)).
AuxDPO: The proposed algorithm that adds auxiliary variables to the DPO loss to relax the constraint that the reward must lie exactly on the policy's tangent manifold.
tabular policy: A theoretical policy class where the probability of every action in every state can be set independently (infinite capacity), assumed by original DPO derivations.
parametric policy: A policy class (like a neural network) where probabilities are determined by a finite set of parameters θ, creating a restricted manifold of realizable policies.