RRM: Robust Reward Model Training Mitigates Reward Hacking

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Reward Modeling

RRM mitigates reward hacking by augmenting training data with counterfactual response pairs to causally disentangle genuine contextual quality signals from spurious artifacts like length or formatting.

Core Problem

Traditional Reward Models (RMs) fail to distinguish between prompt-dependent quality signals and prompt-independent artifacts (e.g., length), leading to 'reward hacking' where models exploit these artifacts.

Why it matters:

LLMs aligned via RLHF often become unnecessarily verbose or overuse formatting because RMs learn to prioritize these easy-to-spot artifacts over actual quality
Current training pairs are always on-topic, preventing the model from seeing 'counterfactuals' where artifacts exist without the correct context
Reward hacking degrades the actual helpfulness and honesty of aligned models despite high reward scores

Concrete Example: If 80% of preferred responses in a dataset are long, a standard RM learns to simply count tokens to predict the winner. Consequently, the aligned Policy generates bloated, repetitive paragraphs even for simple yes/no questions to maximize this length-biased reward.

Key Novelty

Causal Data Augmentation for Reward Modeling

Models the preference problem as a causal graph distinguishing 'Contextual Signals' (depend on prompt) from 'Artifacts' (independent of prompt)
Constructs augmented training pairs by mixing responses from different prompts to break the correlation between artifacts and preference labels
Trains the reward model to prefer contextual responses over non-contextual ones regardless of artifacts, forcing it to learn actual prompt-response relevance

Architecture

The RRM pipeline showing how training data is augmented. It illustrates mixing responses from different examples to create pairs where artifacts are balanced or randomized.

Evaluation Highlights

+3.54% accuracy improvement on RewardBench (80.61% to 84.15%) using Gemma-2-9b-it
+19.03% increase in length-controlled win-rate on AlpacaEval-2 (33.46% to 52.49%) for a DPO policy trained with RRM
+1.04 score improvement on MT-Bench (7.27 to 8.31) for the downstream aligned policy

Breakthrough Assessment

7/10

Significant improvements in both reward modeling accuracy and downstream policy performance by addressing a fundamental causal flaw in RLHF data construction. The method is simple yet highly effective.

⚙️ Technical Details

Problem Definition

Setting: Pairwise preference modeling where a model predicts P(y1 > y2 | x)

Inputs: Prompt x and a pair of responses (y1, y2)

Outputs: Preference probability (or binary classification of the preferred response)

Pipeline Flow

Data Augmentation (Generate counterfactual pairs)
Pairwise Ranking (Inference/Training)

System Modules

Data Augmentor

Creates new training triplets by mixing responses from different prompts

Model or implementation: Algorithmic permutation

Reward Model

Predicts preference between two responses given a prompt

Model or implementation: Gemma-2-9b-it (Fine-tuned)

Novel Architectural Elements

Causal data augmentation pipeline that explicitly pairs contextual responses with non-contextual (off-topic) responses to disentangle artifacts from quality

Modeling

Base Model: Gemma-2-9b-it

Training Method: Pairwise Ranking Optimization (Reward Modeling) followed by DPO (Policy Training)

Objective Functions:

Purpose: Maximize likelihood of preferred response.

Formally: Pairwise ranking loss (minimizing negative log-likelihood of correct ranking)
Purpose: Policy alignment.

Formally: DPO (Direct Preference Optimization) loss L_DPO

Training Data:

Original dataset D_hf expanded to D_tilde_hf via random permutations
Rules: Contextual > Non-contextual; Non-contextual vs Non-contextual = Tie

Key Hyperparameters:

statistical_methodology: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. ODIN: RRM uses data augmentation to causally separate artifacts rather than architectural modifications
vs. Length-controlled Alpaca: RRM is a general causal framework handling unobservable artifacts, not just explicitly modeled length
vs. Standard RM: RRM trains on counterfactual (off-topic) pairs to learn robustness [not cited in paper, but implicit baseline]

Limitations

Assumes that context-free artifacts (A) and contextual signals (S) are sufficiently distinguishable via permutation
Requires defining heuristic preference rules for augmented pairs (e.g., 'contextual wins')
Computationally increases dataset size due to augmentation

Reproducibility

The paper provides the causal logic and augmentation rules. Code URL is not provided. Base model (Gemma-2-9b-it) is public. Specific training hyperparameters (LR, batch size) are not detailed in the provided text.

📊 Experiments & Results

Evaluation Setup

Reward model accuracy evaluation and downstream policy alignment evaluation

Benchmarks:

RewardBench (Reward Model Evaluation)
MT-Bench (Multi-turn conversation quality)
AlpacaEval-2 (Instruction following (length-controlled))

Metrics:

Accuracy (RewardBench)
MT-Bench Score
Length-controlled Win-rate (AlpacaEval-2)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RewardBench	Accuracy	80.61	84.15	+3.54
MT-Bench	Score	7.27	8.31	+1.04
AlpacaEval-2	Length-controlled Win-rate	33.46	52.49	+19.03

Main Takeaways

Policies trained on Robust Reward Models (RRM) consistently outperform those based on baseline RMs, specifically in length-controlled metrics.
The approach effectively filters out undesirable artifacts like verbosity without needing explicit penalties or architectural changes.
Causal data augmentation is a viable strategy for improving RM generalization and robustness.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Bradley-Terry Model
Causal Inference (DAGs, d-separation)

Key Terms

Reward Hacking: When a model exploits flaws in the reward function (e.g., producing long but empty text) to maximize score without achieving the intended goal

DPO: Direct Preference Optimization—an algorithm for aligning language models to preferences without explicitly training a separate reward model during the policy update phase

RLHF: Reinforcement Learning from Human Feedback—a method to align LLMs using a reward model trained on human preferences

DAG: Directed Acyclic Graph—a graphical representation of causal relationships between variables

Contextual Signal: The genuine quality aspect of a response that depends on how well it answers the specific prompt

Artifact: Features of a response (like length or markdown) that are independent of the prompt but often spuriously correlated with human preference

Sufficient Statistic: A statistic that captures all the information in the data relevant to the parameter being estimated