← Back to Paper List

Adversarial Training of Reward Models

Alexander Bukharin, Haifeng Qian, Shengyang Sun, Adithya Renduchintala, Soumye Singhal, Zhilin Wang, Oleksii Kuchaiev, Olivier Delalleau, Tuo Zhao
NVIDIA, Georgia Institute of Technology
arXiv (2025)
RL Benchmark

📝 Paper Summary

AI Alignment Reward Modeling Adversarial Training
Adv-RM improves reward model robustness by training an adversarial policy to generate out-of-distribution, low-quality responses that receive high rewards, then using these examples to harden the model against reward hacking.
Core Problem
Contemporary reward models (RMs) lack robustness, often assigning high scores to low-quality, out-of-distribution responses, which leads to reward hacking during Reinforcement Learning from Human Feedback (RLHF).
Why it matters:
  • Reward hacking reduces actual alignment with human values as policies exploit unintended shortcuts in the reward model rather than generating high-quality text
  • RMs are used as proxies for human feedback in critical pipelines (data selection, RLHF, moderation), so their failure compromises the entire model lifecycle
  • Existing regularization methods like uncertainty estimation are unreliable for in-distribution responses and fail to capture the full diversity of possible model failures
Concrete Example: A standard reward model might assign a high score to a response containing random text or lacking punctuation because it falls outside the training distribution (OOD). Adv-RM automatically discovers such vulnerabilities—like 'responses that have no punctuation'—to expose these flaws.
Key Novelty
Adversarial Reward Modeling (Adv-RM)
  • Trains an adversarial policy using reinforcement learning to generate responses that maximize the target reward model's score while simultaneously maximizing the disagreement (uncertainty) among an ensemble of reward models
  • Uses these generated 'high-reward, high-uncertainty' samples as negative examples (rejected pairs) in an iterative training loop to robustify the reward model against OOD inputs
Evaluation Highlights
  • Achieves >80% attack success rate in finding adversarial examples for SOTA reward models like Nemotron-4-340B-Reward
  • Enables downstream RLHF training to proceed for 3x as many steps without exhibiting reward hacking compared to conventional reward models
  • Demonstrates a strong negative correlation (-0.70 Pearson) between ensemble uncertainty and ground-truth quality on adversarial samples, validating the detection strategy
Breakthrough Assessment
8/10
Addresses a critical bottleneck in RLHF (reward hacking) with a novel automated red-teaming approach. The ability to attack and robustify SOTA 340B-parameter RMs without human-in-the-loop is highly significant.
×