Evaluating Robustness of Reward Models for Mathematical Reasoning

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Reward Model Evaluation Mathematical Reasoning

RewardMATH is a new benchmark that evaluates the robustness of math reward models by using one-to-many comparisons against diverse incorrect solutions, correlating better with downstream policy performance than existing benchmarks.

Core Problem

Existing reward model benchmarks like RewardBench rely on one-to-one comparisons and human-written chosen solutions that differ distributionally from model outputs, making them unreliable indicators of true robustness.

Why it matters:

Reward models susceptible to 'reward hacking' may score high on benchmarks but degrade policy performance during RLHF due to overoptimization.
Current benchmarks fail to predict how well a reward model will actually guide a policy, leading to wasted compute on ineffective alignment.

Concrete Example: In RewardBench, a human-written 'chosen' solution might skip steps (mental math), while a model-generated 'rejected' solution is verbose. A reward model might learn to prefer short answers (length bias) rather than correctness. Additionally, a model might correctly reject one specific wrong answer but fail against 8 other subtle wrong answers.

Key Novelty

RewardMATH Benchmark

Replaces human-written 'chosen' solutions with correct step-by-step machine-generated solutions to match the distribution of rejected responses, removing length/style confounding factors.
Employs a one-to-many comparison setting (1 correct vs. 9 incorrect solutions from various models) to rigorously test if the reward model can distinguish the correct answer from a diverse set of errors, rather than just one isolated failure case.

Architecture

Contrast between RewardBench (1-vs-1, Human vs Model) and RewardMATH (1-vs-9, Model vs Model).

Evaluation Highlights

Scores on RewardMATH strongly correlate (r² > 0.8) with the performance of policies optimized via Best-of-N sampling, whereas RewardBench shows almost no correlation (r² < 0.13).
Top-ranked RewardBench models (e.g., Oasst-rm-2.1) often perform poorly on RewardMATH and suffer rapid reward overoptimization in downstream tasks.
RewardMATH effectively predicts resistance to reward overoptimization: models with higher RewardMATH scores maintain oracle performance (pass@1) longer as KL divergence increases.

Breakthrough Assessment

8/10

Identifies a critical flaw in current RM evaluation (distribution shift and sparsity of comparisons) and provides a significantly more predictive benchmark for actual RLHF outcomes.

⚙️ Technical Details

Problem Definition

Setting: Evaluating the robustness of a Reward Model (RM) r(x, y) which assigns a scalar score to a solution y given a math problem x.

Inputs: A math problem x, a set of candidate solutions Y = {y_chosen, y_rejected_1, ..., y_rejected_n}

Outputs: Robustness metrics (Accuracy, Mean Reciprocal Rank) based on the RM's ranking of Y

Pipeline Flow

Input Problem x (from MATH500)
Candidate Generation (1 Correct vs 9 Rejected from diverse models)
Reward Modeling Scoring (Assigns scalar to each candidate)
Evaluation (Calculate Acc/MRR based on rank of Correct solution)

System Modules

Candidate Generator

Generate diverse correct and incorrect solutions to form the evaluation set

Model or implementation: Ensemble of 14 models (LLaMA3, GPT-4o, WizardMath, etc.)

Reward Model Scorer

Assign a reward score to each of the 10 candidate solutions

Model or implementation: Various target RMs (e.g., Internlm2-7b-reward, ArmoRM, Math-Shepherd)

Novel Architectural Elements

One-to-many comparison topology: 1 chosen vs 9 rejected (sampled from 14 distinct models) per problem
Homogenized solution distribution: Replaces human-written chosen solutions with machine-generated correct solutions to remove length/style bias (distribution shift) between classes

Modeling

Base Model: Evaluates various RMs: Internlm2-7b/20b-reward, ArmoRM-Llama3-8B, Skywork-Reward, Math-Shepherd, etc.

Training Method: Evaluation paper - no new model training proposed, but evaluates existing trained models

Compute: Not reported in the paper

Comparison to Prior Work

vs. RewardBench: Uses 1-vs-9 comparisons instead of 1-vs-1; uses machine-generated chosen solutions instead of human-written ones to avoid style bias; focuses strictly on math robustness
vs. PRM800K evaluations: Evaluates final solution ranking robustness rather than just step-level correctness; addresses label noise in PRM800K

Limitations

Currently focuses only on the mathematical reasoning domain
Relies on one-to-many comparison (1:9); many-to-many could be more robust but resource-intensive
Requires ground truth correctness labels for constructing the benchmark (unlike pure preference data)

Reproducibility

Code: https://huggingface.co/spaces/RewardMATH/RewardMATH

Dataset and code publicly available at HuggingFace. The construction of the dataset involves using GPT-4 and 14 other models, which are detailed. The exact prompts for LLM-as-a-judge are provided in Appendix D.

📊 Experiments & Results

Evaluation Setup

Ranking correct vs incorrect math solutions

Benchmarks:

RewardMATH (Reward Model Evaluation (Math)) [New]
RewardBench (Math Subset) (Reward Model Evaluation)

Metrics:

Accuracy (Acc): % where chosen > all rejected
Mean Reciprocal Rank (MRR)
Correlation (R^2 / Spearman) with Downstream Policy Performance
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of Reward Model performance on RewardBench vs. RewardMATH shows significant ranking discrepancies, indicating RewardBench may not capture robustness.
RewardMATH (Acc)	Accuracy	95.10	7.04	-88.06
RewardMATH (Acc)	Accuracy	94.90	37.27	-57.63
RewardMATH (Acc)	Accuracy	94.41	17.18	-77.23
Correlation analysis validates that RewardMATH scores predict actual downstream policy improvement, unlike RewardBench.
MATH500	R^2 (Coefficient of Determination)	0.128	0.80	Quantitative comparison not direct subtraction

Experiment Figures

Scatter plots showing the correlation between RM benchmark scores (x-axis) and Policy Accuracy Gain (y-axis) via Best-of-N.

Curves of Gold Reward and Oracle Reward (Pass@1) vs KL Divergence for different RMs.

Main Takeaways

RewardBench performance is a poor predictor of how well a reward model improves a policy (Best-of-N), showing almost no correlation.
RewardMATH effectively identifies robust reward models; models that score high on RewardMATH resist reward overoptimization (collapse of pass@1) for longer during training/sampling.
Machine-generated 'chosen' solutions are crucial for evaluation: replacing human solutions removes spurious cues (like length) that weak RMs exploit.
One-to-many comparison is essential: models often correctly reject one wrong answer but fail on others for the same problem.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Reward Models (RM) and Process Reward Models (PRM)
Best-of-N (BoN) sampling
KL Divergence in RL

Key Terms

Reward Overoptimization: A phenomenon where optimizing a policy against a proxy reward model eventually leads to a decrease in the true reward (ground truth performance) as the policy exploits the proxy's flaws.

Best-of-N (BoN): An inference-time method where N solutions are generated, and the one with the highest reward model score is selected.

Reward Hacking: When a policy generates outputs that get high scores from the reward model but are actually poor or incorrect according to human preference.

Process Reward Model (PRM): A reward model that scores each intermediate step of a reasoning chain rather than just the final answer.

Mean Reciprocal Rank (MRR): A ranking metric used here to evaluate how high the correct solution is ranked among incorrect ones; MRR = 1/rank.

Generative Reward Model: Using an LLM to evaluate responses either by direct scoring or pairwise ranking prompts (LLM-as-a-judge).

PPO: Proximal Policy Optimization—an RL algorithm used to fine-tune policies using signals from the reward model.

Bradley-Terry (BT) model: A statistical model for estimating the probability that one item is preferred over another in pairwise comparisons.