Tianqi Liu, Wei Xiong, Jie Ren, Lichang Chen, Junru Wu, Rishabh Joshi, Yang Gao, Jiaming Shen, Zhen Qin, Tianhe Yu, Daniel Sohn, A. Makarova, Jeremiah Liu, Yuan Liu, Bilal Piot, Abe Ittycheriah, Aviral Kumar, Mohammad Saleh
Google DeepMind,
University of Illinois Urbana-Champaign,
University of Maryland, College Park
International Conference on Learning Representations
(2024)
RLP13NBenchmark
📝 Paper Summary
Reinforcement Learning from Human Feedback (RLHF)Reward Modeling
RRM mitigates reward hacking by augmenting training data with counterfactual response pairs to causally disentangle genuine contextual quality signals from spurious artifacts like length or formatting.
Core Problem
Traditional Reward Models (RMs) fail to distinguish between prompt-dependent quality signals and prompt-independent artifacts (e.g., length), leading to 'reward hacking' where models exploit these artifacts.
Why it matters:
LLMs aligned via RLHF often become unnecessarily verbose or overuse formatting because RMs learn to prioritize these easy-to-spot artifacts over actual quality
Current training pairs are always on-topic, preventing the model from seeing 'counterfactuals' where artifacts exist without the correct context
Reward hacking degrades the actual helpfulness and honesty of aligned models despite high reward scores
Concrete Example:If 80% of preferred responses in a dataset are long, a standard RM learns to simply count tokens to predict the winner. Consequently, the aligned Policy generates bloated, repetitive paragraphs even for simple yes/no questions to maximize this length-biased reward.
Key Novelty
Causal Data Augmentation for Reward Modeling
Models the preference problem as a causal graph distinguishing 'Contextual Signals' (depend on prompt) from 'Artifacts' (independent of prompt)
Constructs augmented training pairs by mixing responses from different prompts to break the correlation between artifacts and preference labels
Trains the reward model to prefer contextual responses over non-contextual ones regardless of artifacts, forcing it to learn actual prompt-response relevance
Architecture
The RRM pipeline showing how training data is augmented. It illustrates mixing responses from different examples to create pairs where artifacts are balanced or randomized.
Evaluation Highlights
+3.54% accuracy improvement on RewardBench (80.61% to 84.15%) using Gemma-2-9b-it
+19.03% increase in length-controlled win-rate on AlpacaEval-2 (33.46% to 52.49%) for a DPO policy trained with RRM
+1.04 score improvement on MT-Bench (7.27 to 8.31) for the downstream aligned policy
Breakthrough Assessment
7/10
Significant improvements in both reward modeling accuracy and downstream policy performance by addressing a fundamental causal flaw in RLHF data construction. The method is simple yet highly effective.
⚙️ Technical Details
Problem Definition
Setting: Pairwise preference modeling where a model predicts P(y1 > y2 | x)
Inputs: Prompt x and a pair of responses (y1, y2)
Outputs: Preference probability (or binary classification of the preferred response)
Pipeline Flow
Data Augmentation (Generate counterfactual pairs)
Pairwise Ranking (Inference/Training)
System Modules
Data Augmentor
Creates new training triplets by mixing responses from different prompts
Model or implementation: Algorithmic permutation
Reward Model
Predicts preference between two responses given a prompt
Model or implementation: Gemma-2-9b-it (Fine-tuned)
Novel Architectural Elements
Causal data augmentation pipeline that explicitly pairs contextual responses with non-contextual (off-topic) responses to disentangle artifacts from quality
Modeling
Base Model: Gemma-2-9b-it
Training Method: Pairwise Ranking Optimization (Reward Modeling) followed by DPO (Policy Training)
Objective Functions:
Purpose: Maximize likelihood of preferred response.
Formally: Pairwise ranking loss (minimizing negative log-likelihood of correct ranking)
Purpose: Policy alignment.
Formally: DPO (Direct Preference Optimization) loss L_DPO
Training Data:
Original dataset D_hf expanded to D_tilde_hf via random permutations
Rules: Contextual > Non-contextual; Non-contextual vs Non-contextual = Tie
Key Hyperparameters:
statistical_methodology: Not explicitly reported in the paper
Compute: Not reported in the paper
Comparison to Prior Work
vs. ODIN: RRM uses data augmentation to causally separate artifacts rather than architectural modifications
vs. Length-controlled Alpaca: RRM is a general causal framework handling unobservable artifacts, not just explicitly modeled length
vs. Standard RM: RRM trains on counterfactual (off-topic) pairs to learn robustness [not cited in paper, but implicit baseline]
Limitations
Assumes that context-free artifacts (A) and contextual signals (S) are sufficiently distinguishable via permutation
Computationally increases dataset size due to augmentation
Reproducibility
The paper provides the causal logic and augmentation rules. Code URL is not provided. Base model (Gemma-2-9b-it) is public. Specific training hyperparameters (LR, batch size) are not detailed in the provided text.
📊 Experiments & Results
Evaluation Setup
Reward model accuracy evaluation and downstream policy alignment evaluation
Benchmarks:
RewardBench (Reward Model Evaluation)
MT-Bench (Multi-turn conversation quality)
AlpacaEval-2 (Instruction following (length-controlled))
Metrics:
Accuracy (RewardBench)
MT-Bench Score
Length-controlled Win-rate (AlpacaEval-2)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
RewardBench
Accuracy
80.61
84.15
+3.54
MT-Bench
Score
7.27
8.31
+1.04
AlpacaEval-2
Length-controlled Win-rate
33.46
52.49
+19.03
Main Takeaways
Policies trained on Robust Reward Models (RRM) consistently outperform those based on baseline RMs, specifically in length-controlled metrics.
The approach effectively filters out undesirable artifacts like verbosity without needing explicit penalties or architectural changes.
Causal data augmentation is a viable strategy for improving RM generalization and robustness.
📚 Prerequisite Knowledge
Prerequisites
Reinforcement Learning from Human Feedback (RLHF)
Bradley-Terry Model
Causal Inference (DAGs, d-separation)
Key Terms
Reward Hacking: When a model exploits flaws in the reward function (e.g., producing long but empty text) to maximize score without achieving the intended goal
DPO: Direct Preference Optimization—an algorithm for aligning language models to preferences without explicitly training a separate reward model during the policy update phase
RLHF: Reinforcement Learning from Human Feedback—a method to align LLMs using a reward model trained on human preferences
DAG: Directed Acyclic Graph—a graphical representation of causal relationships between variables
Contextual Signal: The genuine quality aspect of a response that depends on how well it answers the specific prompt
Artifact: Features of a response (like length or markdown) that are independent of the prompt but often spuriously correlated with human preference
Sufficient Statistic: A statistic that captures all the information in the data relevant to the parameter being estimated