Evaluation Setup
Synthetic RLHF setting (Gao et al., 2023) and real-data RLHF (Wang et al., 2024)
Benchmarks:
- Synthetic RLHF (Reinforcement Learning from Human Feedback simulation)
Metrics:
- Attack Success Rate (finding OOD high-reward samples)
- Pearson Correlation (Uncertainty vs. Ground Truth Quality)
- RLHF Training Steps (before reward hacking occurs)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Synthetic Analysis |
Pearson Correlation (Uncertainty vs. Quality) |
-0.05 |
-0.70 |
-0.65
|
Main Takeaways
- Adv-RM effectively exposes vulnerabilities in state-of-the-art reward models, achieving >80% attack success rates on models as large as 340B parameters.
- Incorporating adversarial samples into training significantly extends the stability of RLHF, allowing for 3x more training steps before reward hacking degrades performance.
- Ensemble uncertainty is a strong signal for OOD detection only when uncertainty is high (as in adversarial samples); it is weakly correlated with quality for in-distribution data.