← Back to Paper List

ODIN: Disentangled Reward Mitigates Hacking in RLHF

Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro
University of Maryland, College Park, NVIDIA
arXiv (2024)
RL Benchmark Factuality

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Reward Modeling AI Alignment
ODIN mitigates reward hacking by training a reward model with separate heads for content quality and length, then discarding the length head during reinforcement learning to prevent verbosity.
Core Problem
RLHF policies often exploit reward models by generating excessively long, verbose responses ('length hacking') to maximize scores without improving actual quality.
Why it matters:
  • Reward hacking leads to 'over-optimization' where models achieve high reward scores but fail to satisfy user intent
  • Verbosity bias in human and model evaluation creates a feedback loop that degrades model utility and efficiency
  • Existing fixes like length penalties are brittle and require extensive hyperparameter tuning that is difficult to generalize
Concrete Example: A well-formatted but verbose and less helpful response often receives a higher score from a standard reward model than a concise, correct answer, causing the policy to learn to output filler text.
Key Novelty
ODIN (Omitted Disentangled INformation)
  • Modifies the reward model to have two output heads sharing the same backbone features: one for 'Quality' and one for 'Length'
  • Trains these heads with specific losses (Pearson correlation penalty and orthogonality loss) to force the Quality head to be independent of token count
  • Discards the Length head during the RL phase, using only the disentangled Quality signal to guide the policy
Architecture
Architecture Figure Figure 1
Overview of the ODIN framework comparing RM Training and RL Finetuning stages.
Evaluation Highlights
  • Reduces reward model's Pearson correlation with length from 0.451 (Baseline) to -0.03 (ODIN), effectively eliminating length bias
  • Achieves a superior Pareto front for Win Score vs. Length compared to PPO and ReMax baselines (with and without length penalties)
  • Maintains or improves performance on standard benchmarks (MMLU, TruthfulQA) while controlling verbosity
Breakthrough Assessment
7/10
Offers a principled, architectural solution to a pervasive RLHF problem (length hacking) that works better than heuristic fixes. Verification is thorough, though the method is a specific refinement of RM training rather than a paradigm shift.
×