RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using a reward model trained on human preferences
Overoptimization: A phenomenon where optimizing a policy against an imperfect proxy reward model eventually leads to a regression in performance according to the true reward (also known as Goodhart's Law)
Proxy Reward Model: A neural network trained to approximate human preferences (or a 'gold' standard) based on limited data
Gold Reward Model: In this synthetic setup, a large, fixed model acting as the ground-truth 'human' evaluator
BoN: Best-of-N sampling—generating N responses and selecting the one with the highest reward score
PPO: Proximal Policy Optimization—an RL algorithm that updates the policy to maximize reward while limiting how much the policy changes at each step
WCO: Worst-Case Optimization—using the minimum score from an ensemble of reward models as the training signal
UWO: Uncertainty-Weighted Optimization—using the mean score minus a weighted variance term from an ensemble as the training signal
KL divergence: A measure of difference between two probability distributions, used here to measure how far the optimized policy has drifted from the initial model