reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs

📝 Paper Summary

Reward Modeling AI Alignment Model Robustness

reWordBench reveals that state-of-the-art reward models are brittle to meaning-preserving input transformations, but a simple regularization objective enforcing score consistency on paraphrases significantly improves robustness and downstream alignment quality.

Core Problem

Reward Models (RMs) often overfit to spurious training artifacts, causing them to assign drastically different scores to semantically equivalent inputs (e.g., paraphrases, format changes), which leads to reward hacking and poor alignment.

Why it matters:

RMs are the compass for aligning LLMs; if they are brittle, policy models will exploit these flaws (reward hacking) rather than learning intended behaviors
Current benchmarks like RewardBench may overestimate RM capability due to overfitting, masking the models' inability to generalize to diverse, realistic user inputs (typos, different formats)
Spurious correlations in RMs can degrade the safety and utility of aligned models in deployment

Concrete Example: In a math problem, simply changing the answer format from a standard LaTeX box `\boxed{76^\circ}` to a markdown header `# Answer 76^\circ` causes a state-of-the-art RM's ranking accuracy to drop from >95% to 73%.

Key Novelty

Paraphrase-Consistency Regularization for RMs

Constructs reWordBench, a benchmark of 28 transformations (controlled, naturalistic, domain-specific) to systematically stress-test RM consistency
Proposes a regularization term during RM training that forces the model to assign similar scores to an original input and its automatically generated paraphrase
Demonstrates that robustness to paraphrasing generalizes to other distinct transformations (e.g., code minification, typos) and improves downstream best-of-n alignment

Evaluation Highlights

State-of-the-art RMs suffer massive degradation on reWordBench; e.g., on the Reasoning subset, standard training drops 20.7% in accuracy under paraphrase transformations
The proposed regularized RM reduces accuracy degradation by roughly half (16.6% -> 8.7%) on the RewardBench Chat Hard subset compared to standard training
In downstream alignment (Best-of-64), the regularized RM produces outputs that win up to 59% of the time against a standard-trained RM according to GPT-4o judges

Breakthrough Assessment

8/10

Significantly exposes the fragility of current SOTA reward models and provides a simple, effective fix that generalizes well. A strong 'wake-up call' paper for the alignment community.

⚙️ Technical Details

Problem Definition

Setting: Ranking Robustness under Transformation

Inputs: Prompt x, Winning Response y_w, Losing Response y_l, Transformation function delta

Outputs: Reward scores maintaining the ranking: I[RM(x', y_w') > RM(x', y_l')] where x', y' are transformed inputs

Pipeline Flow

Input Generation (Original RewardBench instances)
Transformation (Apply one of 28 transformations e.g., Typos, Paraphrase, Code Minification)
Reward Scoring (Model scores original and transformed pairs)
Robustness Evaluation (Check if ranking preference is preserved)

System Modules

Transformation Engine

Apply systematic changes to inputs to test robustness

Model or implementation: Various scripts (Python minifier, Llama-3-70B-Instruct for paraphrasing, algorithmic text modifiers)

Reward Model

Assign scalar scores to responses to determine ranking

Model or implementation: Llama-3-8B-based regressor (initialized from SFT)

Modeling

Base Model: Llama-3-8B (initialized from RLHFlow/LLaMA3-SFT)

Training Method: Regression with Consistency Regularization

Objective Functions:

Purpose: Standard reward modeling regression.

Formally: E[(RM(x, y) - s)^2]
Purpose: Consistency regularization to enforce similar scores for paraphrases.

Formally: alpha * E[(RM(x, y) - RM(x, y_tilde))^2] where y_tilde is a paraphrase

Training Data:

HelpSteer2 dataset (open-ended conversations)
Paraphrases generated by Llama-3-70B-Instruct

Key Hyperparameters:

regularization_coefficient_alpha: 10
learning_rate: Not reported in the paper
batch_size: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard RM: Adds consistency loss term to penalize score divergence on semantically equivalent inputs
vs. Data Augmentation: Explicitly links the score of the original and transformed input via regularization (Eq. 3) rather than just treating them as independent samples (Eq. 4)
vs. Shen et al. (2024a): Complementary focus; Shen et al. study sensitivity to meaning-altering changes, while this work focuses on robustness to meaning-preserving changes [not cited in paper as direct baseline, but discussed]

Limitations

Regularization relies on automated paraphrasing, which may not always perfectly preserve meaning (though manual checks were done)
Safety subset robustness did not improve, likely because the training data (HelpSteer2) lacks explicit safety examples
Evaluation relies heavily on LM judges (GPT-4o, Llama-3-70B), which may have their own biases
The method was only tested with Llama-3-8B as the base RM; scaling effects to larger RMs were not fully explored in the training experiments

Reproducibility

Code availability is not explicitly provided in the paper text. The method relies on public datasets (RewardBench, HelpSteer2) and models (Llama-3), but specific training scripts and the reWordBench generation code are not linked.

📊 Experiments & Results

Evaluation Setup

Ranking accuracy on RewardBench and reWordBench (transformed versions)

Benchmarks:

RewardBench (Standard RM Evaluation (Chat, Chat Hard, Safety, Reasoning))
reWordBench (Robustness Evaluation (28 transformations)) [New]

Metrics:

Ranking Accuracy Drop (%)
Win Rate against SFT/Standard-RM (for alignment)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Robustness evaluation on reWordBench showing accuracy drops on Paraphrase transformations.
reWordBench (Chat Hard)	Accuracy Drop (Paraphrase)	16.6	8.7	-7.9
reWordBench (Reasoning)	Accuracy Drop (Paraphrase)	20.7	15.8	-4.9
Generalization of robustness to non-paraphrase transformations (e.g. typos, formatting).
reWordBench (Chat Hard)	Accuracy Drop (Other Transf.)	6.6	6.4	-0.2
reWordBench (Safety)	Accuracy Drop (Other Transf.)	11.8	3.9	-7.9
Downstream Alignment Utility (Best-of-N) evaluated by Llama-3-70B-Instruct Judge.
RewardBench Prompts (Best-of-64)	Win Rate vs Standard RM	50.0	59.0	+9.0

Main Takeaways

SOTA Reward Models are highly brittle, with accuracy often dropping to random or worse under simple transformations like formatting changes or typos
Regularizing score consistency on paraphrases is a highly effective strategy that improves robustness not just to paraphrases, but generalizes to other transformation types (e.g., safety-targeted attacks)
Robust RMs are better RMs: The improvements in robustness translate directly to better downstream alignment performance, producing higher-quality generations in Best-of-N and RAFT settings

📚 Prerequisite Knowledge

Prerequisites

Reward Modeling (RM)
Reinforcement Learning from Human Feedback (RLHF)
Adversarial Examples / Robustness
Language Model Alignment

Key Terms

Reward Model: A model trained to predict a scalar score representing human preference for a given text response

reWordBench: The authors' proposed benchmark consisting of RewardBench instances modified by 28 meaning- or ranking-preserving transformations

SFT: Supervised Fine-Tuning—the initial phase of training an LLM on high-quality instruction-response pairs

Best-of-n: An inference-time alignment strategy where n candidate responses are generated and the one with the highest Reward Model score is selected

RAFT: Reward Ranked FineTuning—an alignment method where the SFT model is fine-tuned on the best-of-n samples selected by the Reward Model

Rot-13: A simple letter substitution cipher that replaces a letter with the 13th letter after it; used here to test if RMs can handle simple encoding shifts

HelpSteer2: The open-ended conversational dataset used to train the Reward Models in this paper