Reward Reasoning Model - Paper Summary

📝 Paper Summary

Reward Modeling Reinforcement Learning from Human Feedback (RLHF) Test-time Compute Scaling

Reward Reasoning Models (RRMs) improve preference labeling by performing explicit chain-of-thought reasoning before generating a reward, trained via reinforcement learning without needing human-annotated reasoning traces.

Core Problem

Existing reward models typically use uniform computational resources for all queries, failing to adapt to complex tasks that require extensive reasoning to evaluate correctly.

Why it matters:

Standard scalar reward models struggle with complex math or coding queries where the correctness of a response isn't immediately obvious without step-by-step verification
Current approaches cannot scale test-time compute; they expend the same effort on trivial questions as they do on hard reasoning problems
Verifiable reward learning (RLVR) is limited to domains with ground-truth answers (like math), whereas general-purpose reward models are needed for open-ended domains

Concrete Example: In a coding task where one assistant provides a correct solution using bitwise operations and another provides a flawed solution using loops, a standard reward model might superficially prefer the loop-based one due to length or style. An RRM explicitly reasons ('Wait, the bitwise approach doesn't apply to powers of three... let me test it') to discover the subtle error before assigning the reward.

Key Novelty

Reward Reasoning via Reinforcement Learning

Formulate reward modeling as a reasoning task where the model generates a thought process (Chain-of-Thought) before outputting a preference judgment
Train the reward model using Reinforcement Learning (GRPO) on unlabeled reasoning traces, optimizing only for the final correctness of the preference label against a ground truth
Enable test-time scaling for evaluation by allowing the model to 'think' longer or use tournament-style voting for multi-candidate ranking

Architecture

Comparison of Scalar Reward Models, Generative Reward Models, and Reward Reasoning Models (RRMs) input/output flows.

Evaluation Highlights

RRM-32B achieves 98.6% accuracy on RewardBench Reasoning subset, outperforming GPT-4o (88.1%) and Claude 3.5 Sonnet (84.7%)
On MATH best-of-N selection, RRM-32B (voting@5) achieves 91.8% accuracy, surpassing GPT-4o-0806 (56.9%) significantly
Reinforcement learning with RRM-generated labels improves downstream GPQA performance from ~30% to over 40% using only unlabeled queries

Breakthrough Assessment

9/10

Significantly advances reward modeling by successfully applying the 'reasoning' paradigm (o1/R1 style) to evaluators. The ability to scale test-time compute for rewards is a major shift from scalar models.

⚙️ Technical Details

Problem Definition

Setting: Pairwise preference ranking with explicit reasoning generation

Inputs: A query Q and two candidate responses A and B

Outputs: A reasoning trace (Chain-of-Thought) followed by a final preference decision (e.g., 'Assistant 1')

Pipeline Flow

Input: Query + Response A + Response B
Reasoning Phase: Autoregressive generation of thought process
Judgment Phase: Generation of final verdict (Assistant 1/2)
Optional: Multi-response aggregation via ELO or Knockout Tournament

System Modules

Reward Reasoning Model

Analyze responses via reasoning chain and output preference

Model or implementation: Qwen2-based (DeepSeek-R1-Distill) decoder-only transformer

Novel Architectural Elements

Integration of chain-of-thought generation directly into the reward modeling objective
Application of rule-based reinforcement learning (sparse rewards based on label correctness) to evolve reward reasoning without human-written reasoning traces

Modeling

Base Model: DeepSeek-R1-Distill-Qwen-1.5B, 7B, 14B, and 32B

Training Method: Reinforcement Learning (GRPO)

Objective Functions:

Purpose: Maximize probability of selecting the correct preference label.

Formally: Reward R = +1 if RRM selects correct response, -1 otherwise.

Training Data:

Mixture of Skywork-Reward (80K), Tülu 3 (80K), and custom synthetic data
Custom data: 180K pairs synthesized from WebInstruct, Skywork-OR1, Big-Math-RL, DAPO-Math using Deepseek-R1-Distill-Qwen-1.5B/7B generators and rule-based verification (Correct vs Incorrect)

Key Hyperparameters:

learning_rate: 1e-6 (32B model), 2e-6 (7B model)
beta: 0.001 (KL penalty)
epsilon: 0.2 (clip ratio)
+ 3 more
max_prompt_length: 2048
max_response_length: 2048
optimizer: AdamW

Compute: Trained on AMD Instinct MI300X Accelerators

Comparison to Prior Work

vs. Skywork-Reward: RRM uses explicit reasoning tokens (test-time compute) rather than a direct scalar head
vs. JudgeLM: RRM is trained via RL on ground-truth correctness labels rather than supervised fine-tuning on synthetic explanations
vs. DeepSeek-GRM: RRM demonstrates specific scaling properties (parallel and sequential) and introduces ELO/Knockout strategies for multi-response ranking

Limitations

Computational cost is higher than scalar reward models due to the generation of long reasoning traces
Currently restricted to pairwise inputs; requires tournament strategies for listwise ranking
Reliance on high-quality distillation data from larger models (DeepSeek-R1) for initialization

Reproducibility

Code: https://github.com/allenai/reward-bench

Models are available at https://huggingface.co/Reward-Reasoning. Training framework uses the 'verl' library. Exact training time/cost not reported. Training data mixture ratios provided (5:1:1:1 for 32B).

📊 Experiments & Results

Evaluation Setup

Reward modeling benchmarks (pairwise classification) and downstream application proxy tasks

Benchmarks:

RewardBench (Pairwise Preference Classification (Chat, Reasoning, Safety))
PandaLM Test (Pairwise Preference Classification (Instruction Following))
Preference Proxy Evaluations (PPE) (Best-of-N selection on MMLU-Pro, MATH, GPQA)

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on standard reward modeling benchmarks showing RRM superiority, particularly in reasoning tasks.
RewardBench	Overall Accuracy	86.7	91.9	+5.2
RewardBench	Reasoning Subset Accuracy	95.6	98.6	+3.0
RewardBench	Chat Hard Subset Accuracy	54.2	81.4	+27.2
Performance on Reward-Guided Best-of-N Inference (Preference Proxy Evaluations).
MATH (PPE Best-of-N)	Accuracy	56.9	91.8	+34.9
GPQA (PPE Best-of-N)	Accuracy	44.0	64.3	+20.3

Experiment Figures

Effect of sequential test-time scaling (thinking budget) on RewardBench accuracy.

Main Takeaways

RRMs effectively leverage test-time compute: performance scales monotonically with increased reasoning length (sequential scaling) and majority voting samples (parallel scaling).
Reinforcement learning with simple binary correctness rewards (+1/-1) is sufficient to induce complex reasoning behaviors (reflection, self-correction) without supervised reasoning traces.
Post-training LLMs using RRM-generated labels leads to significant gains on hard benchmarks like Arena-Hard, outperforming even GPT-4o supervision.
The ELO rating strategy for multi-response evaluation consistently outperforms Knockout Tournament, though Tournament is more compute-efficient (O(n) vs O(n^2)).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Chain-of-Thought (CoT) Reasoning
Language Models as Judges (LLM-as-a-Judge)

Key Terms

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes advantages within a group of samples to stabilize training without a separate value function

Chain-of-Thought: A prompting or training technique where the model generates intermediate reasoning steps before producing a final answer

Test-time compute: The amount of computational resources (time, tokens, or parallel samples) used during inference to improve output quality

ELO rating: A rating system calculated from pairwise win/loss records (originally for chess) used here to rank multiple model responses

Best-of-N: An inference strategy where N candidate responses are generated, and a reward model selects the best one

Scalar Reward Model: A standard reward model architecture that outputs a single numerical score for a response, usually via a regression head

Generative Reward Model: A reward model that outputs text (like a judge) rather than just a number, allowing for reasoning or explanation

DeepSeek-R1: A strong reasoning model family used as the initialization checkpoint for the RRMs in this paper