InfoRM: Information-Theoretic Reward Modeling—the proposed framework that trains a reward model using a variational Information Bottleneck objective to filter irrelevant features
IBL: Information Bottleneck Latent regularization—a penalty term added to the RL objective that discourages the policy from producing responses that are outliers in the InfoRM latent space
MOP: Mahalanobis Outlier Probability—a metric defined as the proportion of RLHF samples flagged as outliers (via Mahalanobis distance) in the latent space, used to quantify hacking severity
Mahalanobis Distance: A distance measure that accounts for correlations between variables in a dataset; used here to determine how far a response's latent representation is from the distribution of 'normal' SFT responses
Reward Hacking: A phenomenon where the policy model exploits flaws in the reward model to get high scores without actually improving performance (also called reward overoptimization)
SFT: Supervised Fine-Tuning—the initial phase of training where the model learns to follow instructions from human demonstrations
KL Divergence: Kullback-Leibler Divergence—a statistical distance measure used in standard RLHF to penalize the policy for drifting too far from the SFT model's probability distribution
PPO: Proximal Policy Optimization—the standard reinforcement learning algorithm used to update the policy model
Information Bottleneck: A technique that seeks to find the most compact representation of the input (bottleneck) that still preserves the information necessary to predict the output