RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using rewards derived from human preferences
DPO: Direct Preference Optimization—an offline method to align language models by optimizing the policy directly on preference pairs
RLOO: REINFORCE Leave-One-Out—an online RL algorithm that estimates advantages using multiple samples
Minimax Regret: A decision rule that minimizes the maximum possible loss (regret) compared to the optimal action across all possible scenarios
Regret: The difference in reward between the optimal policy and the current policy for a given prompt
Informativeness: A metric used in this paper to proxy regret, calculated as the reward advantage (gap between best possible and average/worst response)
SimPO: Simple Preference Optimization—a reference-free alignment method
SPPO: Self-Play Preference Optimization—an algorithm involving iterative policy updates
ORPO: Odds Ratio Preference Optimization—a monolithic preference alignment method
SFT: Supervised Fine-Tuning—the initial training phase using labeled examples
Nash Equilibrium: A stable state in a game where no player can gain by unilaterally changing their strategy