RLHF: Reinforcement Learning from Human Feedback—learning policies using preference comparisons rather than absolute reward signals
Von Neumann winner: A policy that beats any other policy in a head-to-head comparison with probability at least 0.5 (a solution concept for cyclic/general preferences)
Eluder Dimension: A complexity measure for function classes in sequential decision making, quantifying the difficulty of distinguishing functions using sequential queries
Adversarial MDP: An MDP setting where rewards are chosen by an adversary rather than being fixed, often solved using regret-minimization algorithms
Restricted Nash Equilibrium: A Nash equilibrium where players are restricted to a specific subset of policies (here, mapping partial trajectories to actions)
Plackett-Luce Model: A probabilistic model for ranking K items, generalizing the pairwise Bradley-Terry model
P2R: Preference-to-Reward Interface—the proposed algorithm that constructs confidence intervals for rewards based on comparisons
OMLE: Optimistic Maximum Likelihood Estimation—a model-based RL algorithm that acts optimistically with respect to a set of statistically plausible models