Evaluation Setup
Post-hoc calibration of pre-trained Reward Models
Benchmarks:
- RewardBench (Reward Model Evaluation)
- AlpacaEval (LLM Generation Evaluation (Simulated Chat))
Metrics:
- RewardBench Score
- Length-Controlled Win Rate (LC-Win Rate)
- Ranking Correlation (Kendall's Tau/Spearman)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| The method demonstrates efficiency gains compared to retraining-based approaches. |
| Internal Profiling |
Time to calibrate 300k samples |
Not reported in the paper |
30 seconds |
Not reported in the paper
|
Main Takeaways
- Consistent performance gains (avg +3.11) across a diverse set of 33 Reward Models on RewardBench, indicating broad applicability.
- The calibration makes RM-based rankings align better with GPT-4 and human judgments, effectively reducing the preference for 'length-for-length's-sake'.
- Calibration is effective for both standard classifier-based RMs and DPO-based implicit rewards.
- The method is extremely lightweight (seconds on CPU) compared to methods requiring retraining or data generation.