← Back to Paper List

Post-hoc Reward Calibration: A Case Study on Length Bias

Zeyu Huang, Zihan Qiu, Zili Wang, E. Ponti, Ivan Titov
University of Edinburgh, Alibaba Group, University of Amsterdam
International Conference on Learning Representations (2024)
RL Benchmark

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Reward Modeling
Post-hoc Reward Calibration estimates and subtracts bias terms (like length preference) from reward model scores using locally weighted regression without retraining the model.
Core Problem
Reward Models (RMs) often learn spurious correlations, such as favoring longer responses regardless of quality, which leads to 'reward hacking' during RLHF and inaccurate rankings during evaluation.
Why it matters:
  • Biased RMs cause LLMs to generate verbose but low-quality content (reward hacking) during alignment
  • When used as judges, biased RMs (including GPT-4) produce misleading rankings that favor length over substance
  • Existing mitigation strategies usually require expensive retraining, additional data collection, or modifying the RL algorithm itself
Concrete Example: An RM might assign a higher score to a verbose, rambling answer than to a concise, correct one simply because the training data contained longer preferred responses. This causes the aligned LLM to learn that 'longer is better' rather than 'correct is better'.
Key Novelty
Training-free Post-hoc Calibration via Locally Weighted Regression
  • Decomposes the observed reward into a 'true quality' term and a 'bias' term dependent on a specific characteristic (e.g., length)
  • Uses the local average of rewards across the dataset (via Locally Weighted Regression) to approximate the bias curve
  • Subtracts this estimated bias from the original reward scores to recover a calibrated signal, all without updating RM weights
Evaluation Highlights
  • Achieves a 3.11 average performance gain across 33 different Reward Models on the RewardBench benchmark
  • Improves ranking correlation with GPT-4 and human preferences for 8 open-source RMs evaluating 184 LLMs on AlpacaEval
  • Calibrating over 300,000 samples takes only 30 seconds on a single CPU, demonstrating high computational efficiency
Breakthrough Assessment
7/10
Offers a highly practical, low-cost solution to a pervasive problem (length bias) in RLHF. While the method (regression) is standard, applying it post-hoc to RMs is a valuable operational improvement.
×