Reward Overoptimization: A phenomenon where optimizing a policy against a proxy reward model eventually leads to a decrease in the true reward (ground truth performance) as the policy exploits the proxy's flaws.
Best-of-N (BoN): An inference-time method where N solutions are generated, and the one with the highest reward model score is selected.
Reward Hacking: When a policy generates outputs that get high scores from the reward model but are actually poor or incorrect according to human preference.
Process Reward Model (PRM): A reward model that scores each intermediate step of a reasoning chain rather than just the final answer.
Mean Reciprocal Rank (MRR): A ranking metric used here to evaluate how high the correct solution is ranked among incorrect ones; MRR = 1/rank.
Generative Reward Model: Using an LLM to evaluate responses either by direct scoring or pairwise ranking prompts (LLM-as-a-judge).
PPO: Proximal Policy Optimization—an RL algorithm used to fine-tune policies using signals from the reward model.
Bradley-Terry (BT) model: A statistical model for estimating the probability that one item is preferred over another in pairwise comparisons.