DPO: Direct Preference Optimization—an algorithm optimizing a language model to align with preferences by solving for the reward function implicitly, avoiding a separate reward model.
DPOP: DPO-Positive—the proposed variation of DPO that adds a loss term to penalize reducing the probability of preferred completions.
SFT: Supervised Fine-Tuning—the initial phase of training a model on high-quality demonstration data before preference alignment.
Edit Distance: A measure of how dissimilar two strings are (e.g., the number of token changes needed to transform one into the other).
Logits: The raw, unnormalized scores output by the final layer of the neural network before the softmax function converts them to probabilities.
RLHF: Reinforcement Learning from Human Feedback—a method to align models using a learned reward model and reinforcement learning algorithms like PPO.
Plackett-Luce model: A probabilistic model for ranking items, used as the theoretical basis for the implicit reward in DPO.