← Back to Paper List

reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs

Zhaofeng Wu, Michihiro Yasunaga, Andrew Cohen, Yoon Kim, Asli Celikyilmaz, Marjan Ghazvininejad
Fundamental AI Research at Meta, Massachusetts Institute of Technology
arXiv (2025)
RL Benchmark Reasoning

📝 Paper Summary

Reward Modeling AI Alignment Model Robustness
reWordBench reveals that state-of-the-art reward models are brittle to meaning-preserving input transformations, but a simple regularization objective enforcing score consistency on paraphrases significantly improves robustness and downstream alignment quality.
Core Problem
Reward Models (RMs) often overfit to spurious training artifacts, causing them to assign drastically different scores to semantically equivalent inputs (e.g., paraphrases, format changes), which leads to reward hacking and poor alignment.
Why it matters:
  • RMs are the compass for aligning LLMs; if they are brittle, policy models will exploit these flaws (reward hacking) rather than learning intended behaviors
  • Current benchmarks like RewardBench may overestimate RM capability due to overfitting, masking the models' inability to generalize to diverse, realistic user inputs (typos, different formats)
  • Spurious correlations in RMs can degrade the safety and utility of aligned models in deployment
Concrete Example: In a math problem, simply changing the answer format from a standard LaTeX box `\boxed{76^\circ}` to a markdown header `# Answer 76^\circ` causes a state-of-the-art RM's ranking accuracy to drop from >95% to 73%.
Key Novelty
Paraphrase-Consistency Regularization for RMs
  • Constructs reWordBench, a benchmark of 28 transformations (controlled, naturalistic, domain-specific) to systematically stress-test RM consistency
  • Proposes a regularization term during RM training that forces the model to assign similar scores to an original input and its automatically generated paraphrase
  • Demonstrates that robustness to paraphrasing generalizes to other distinct transformations (e.g., code minification, typos) and improves downstream best-of-n alignment
Evaluation Highlights
  • State-of-the-art RMs suffer massive degradation on reWordBench; e.g., on the Reasoning subset, standard training drops 20.7% in accuracy under paraphrase transformations
  • The proposed regularized RM reduces accuracy degradation by roughly half (16.6% -> 8.7%) on the RewardBench Chat Hard subset compared to standard training
  • In downstream alignment (Best-of-64), the regularized RM produces outputs that win up to 59% of the time against a standard-trained RM according to GPT-4o judges
Breakthrough Assessment
8/10
Significantly exposes the fragility of current SOTA reward models and provides a simple, effective fix that generalizes well. A strong 'wake-up call' paper for the alignment community.
×