RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using human preference labels rather than explicit scalar rewards.
Bradley-Terry (BT) Model: A statistical model predicting the probability that one item is preferred over another based on the difference in their underlying latent rewards, typically using a sigmoid function.
Sample Complexity: The number of training samples required for an algorithm to learn a near-optimal policy.
Regret: The difference in accumulated reward between the algorithm's policy and the optimal policy over time.
R_max: The maximum possible value (range) of the underlying reward function. Prior methods scaled exponentially with this value.
Exploration Bonus: An extra term added to the objective function to encourage the model to visit uncertain or under-explored regions of the state/action space.
DPO: Direct Preference Optimization—a method to optimize policies directly from preferences without explicitly training a separate reward model.
MLE: Maximum Likelihood Estimation—a method for estimating parameters (here, the reward function) by maximizing the probability of the observed data.