Reward Hacking: A phenomenon where a policy exploits errors or spurious correlations in a reward model to achieve a high score without actually satisfying the human user's intent.
Underspecification: When a machine learning pipeline works well on in-distribution data but behavior varies significantly on out-of-distribution data (like policy-generated text).
BoN: Best-of-N Reranking—an inference strategy where N samples are generated and the one with the highest reward model score is selected.
RLHF: Reinforcement Learning from Human Feedback—a method to tune language models using a reward model trained on human preferences.
Bradley-Terry Model: A probability model used to predict the outcome of a pairwise comparison (e.g., which of two responses is better) based on a latent reward score.
PPO: Proximal Policy Optimization—an RL algorithm used to update the policy model to maximize reward while limiting the update step size.
Pretrain Ensembles: Ensembles of models where each member was pretrained on the same data but with different random seeds (affecting data order and initialization).
Finetune Ensembles: Ensembles where members share the same pretrained base but are finetuned with different random seeds.
KL Divergence: Kullback-Leibler divergence—a statistical distance measuring how one probability distribution differs from a reference distribution.