Distortion: The worst-case ratio between the optimal average utility achievable (if full utility information were known) and the average utility achieved by a method using only ordinal comparisons
RLHF: Reinforcement Learning from Human Feedback—a method that fits a single reward model to pairwise comparisons and optimizes a policy against it
DPO: Direct Preference Optimization—an alignment method that optimizes policy likelihoods directly from preferences without an explicit reward modeling step
NLHF: Nash Learning from Human Feedback—an alignment method that finds a policy forming a Nash Equilibrium in a zero-sum preference game against other policies
Bradley-Terry Model: A probabilistic model predicting the outcome of a pairwise comparison based on the difference in latent utility scores of the two options
Maximal Lotteries: A probabilistic voting rule that selects a distribution over candidates corresponding to the Nash Equilibrium of the pairwise margin matrix
Borda Count: A voting rule that ranks candidates based on the total number of pairwise comparisons they win; shown here to be equivalent to RLHF's reward modeling objective
KL constraint: A restriction requiring the fine-tuned AI model's output distribution to remain mathematically close to the original pre-trained model's distribution