RLHF: Reinforcement Learning from Human Feedback—aligning models using rewards derived from human preferences
Reward Hacking: When a model exploits flaws in a reward function to achieve high scores without actually meeting the user's intent
Energy Loss: Defined in this paper as the L1-norm of the difference between the input and output hidden states of the final transformer layer
EPPO: Energy loss-aware PPO—the authors' proposed algorithm that penalizes increases in energy loss during RL
SFT: Supervised Fine-Tuning—the initial training phase using ground-truth labels before RLHF
PPO: Proximal Policy Optimization—a standard RL algorithm used to update the language model policy
ODIN: A reward modeling method cited as a baseline for mitigating reward hacking
InfoRM: An information-theoretic reward modeling approach used as a baseline and analysis tool
L1-norm: The sum of the absolute values of a vector's components, used here to measure the magnitude of hidden state changes