Reward Hacking: When a model optimizes for a flaw in the reward function (e.g., using specific words to get a high score) rather than learning the intended task
Goodhart's Law: The principle that when a measure becomes a target, it ceases to be a good measure
LLM-as-a-judge: Using a Large Language Model to evaluate the output of another model, often via prompting
DPO: Direct Preference Optimization—an algorithm that optimizes language models to match preferences without an explicit reward model
PPO: Proximal Policy Optimization—an RL algorithm commonly used to train language models using a reward model
Spurious Correlations: Patterns in data that a model learns which are not actually causal to the task (e.g., length bias, where longer answers are assumed better)
Meta-evaluation: The process of evaluating the evaluators (metrics or reward models) by correlating their scores with human judgments
Exposure Bias: The mismatch between training (where models see ground truth) and testing (where models generate their own history)
CometKiwi: A learned evaluation metric for machine translation based on the InfoXLM model