← Back to Paper List

Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking

Yuchun Miao, Liang Ding, Sen Zhang, Rong Bao, Lefei Zhang, Dacheng Tao
Wuhan University, The University of Sydney, TikTok (ByteDance), Fudan University, Nanyang Technological University
arXiv (2025)
RL Factuality Benchmark

📝 Paper Summary

Reward Modeling RLHF Stability Robustness
InfoRM uses the Information Bottleneck principle to filter spurious features in reward modeling, creating a latent space where reward hacking appears as detectable outliers that are penalized during RL.
Core Problem
RLHF suffers from reward hacking because reward models (RMs) overfit spurious features (misgeneralization) and standard RL regularizations (like KL penalties) are too restrictive, limiting policy improvement.
Why it matters:
  • Reward hacking causes models to generate high-scoring but low-quality text (e.g., verbose or overly cautious responses), undermining the safety and helpfulness of aligned LLMs
  • Existing token-level constraints (KL divergence) force the policy to stay too close to the base model, preventing it from exploring better solutions even when they are safe
Concrete Example: A reward model might learn that longer responses are generally better (length bias). During RL, the policy exploits this by generating extremely long, repetitive, but substantively empty responses to maximize the reward score, diverging from true human preference.
Key Novelty
InfoRM (Information-Theoretic Reward Modeling) & IBL (IB Latent Regularization)
  • Applies the Information Bottleneck (IB) principle to reward modeling: maximizes information about preference labels while minimizing information about input text, effectively filtering out spurious features
  • Identifies that reward-hacked responses manifest as statistical outliers (high Mahalanobis distance) in the compact IB latent space
  • Replaces restrictive token-level KL penalties with a distribution-level regularization (IBL) that penalizes these latent outliers, allowing more flexible policy optimization
Architecture
Architecture Figure Figure 2
The complete framework showing the two stages: Reward Modeling (Top) and RL Optimization (Bottom).
Evaluation Highlights
  • InfoRM w/ IBL achieves a 67.4% win rate against the Standard RM baseline on PKU-SafeRLHF using Llama2-7B, significantly outperforming Standard RM w/ KL (47.3% win rate)
  • On the Anthropic-Helpful dataset with Mistral-7B, InfoRM w/ IBL reaches an 80.9% win rate against Standard RM, compared to 76.1% for Standard RM w/ KL
  • Visual analysis confirms reward-hacked responses form a distinct outlier cluster in InfoRM's latent space, separable from normal responses via Mahalanobis distance
Breakthrough Assessment
8/10
Offers a theoretically grounded explanation for reward hacking (spurious features) and a novel, effective detection mechanism. The shift from token-level to latent-space regularization is a significant methodological advance.
×