RM-R1: Reward Modeling as Reasoning

📝 Paper Summary

Reward Modeling Reinforcement Learning with Human Feedback (RLHF) Chain-of-Thought Reasoning

RM-R1 formulates reward modeling as a reasoning task, using a chain-of-rubrics mechanism and a distillation-then-RL pipeline to enable interpretable, high-accuracy preference judgments.

Core Problem

Existing scalar reward models are opaque, while generative reward models often produce superficial reasoning that fails on complex tasks requiring multifaceted cognitive considerations.

Why it matters:

Scalar reward models provide no justification for scores, limiting transparency and debugging.
Current generative reward models struggle with deep reasoning, leading to suboptimal performance on hard tasks like math or code.
Accurate reward modeling requires simulating consequences and navigating trade-offs, mirroring human grading processes.

Concrete Example: In a reasoning task where a response contains a subtle logical error, a standard scalar reward model might assign a high score based on surface-level fluency. In contrast, RM-R1 first solves the problem itself, identifies the error through step-by-step verification, and then penalizes the response based on the correctness rubric.

Key Novelty

Reasoning Reward Models (ReasRMs) via Chain-of-Rubrics

Treats reward modeling as a reasoning problem where the model must 'think' before judging.
Uses a 'Chain-of-Rubrics' strategy: for chat, it self-generates evaluation criteria; for reasoning, it solves the problem first to establish ground truth.
Combines reasoning-oriented distillation (from strong oracles) with Reinforcement Learning with Verifiable Rewards (RLVR) to optimize the judgment process.

Evaluation Highlights

RM-R1-DeepSeek-Distilled-Qwen-32B achieves state-of-the-art on RM-Bench (91.8% math, 74.1% code accuracy), outperforming GPT-4o.
Outperforms much larger models (e.g., Nemotron-4-340B-Reward, INF-ORM-Llama3.1-70B) by up to 4.9% on average across three benchmarks.
Instruct-based models reach competitive performance using only 8.7K distillation examples, compared to 800K used in prior work like DeepSeek-Distilled.

Breakthrough Assessment

9/10

Establishes a new SOTA for reward modeling by successfully integrating long-chain reasoning. The performance gains over significantly larger models (70B/340B) and the novel Chain-of-Rubrics methodology mark a significant advance.

⚙️ Technical Details

Problem Definition

Setting: Generative Reward Modeling where the model generates a textual judgment j containing a reasoning trace and a final preference label l_hat.

Inputs: A prompt x and two candidate responses y_a, y_b.

Outputs: A generated sequence j containing reasoning steps and the predicted preferred label l_hat.

Pipeline Flow

Query Categorization: Classify input (x, y_a, y_b) as 'Chat' or 'Reasoning'
Reasoning Rollout: Generate reasoning trace (Rubrics/Justification for Chat; Solution for Reasoning)
Judgment: Evaluate responses based on generated trace and select winner

System Modules

Reward Model (Policy)

Generate reasoning trace and final judgment

Model or implementation: Qwen-2.5-Instruct or DeepSeek-R1-Distill-Qwen (various sizes)

Novel Architectural Elements

Chain-of-Rubrics (CoR) mechanism integrated into the reward model rollout via system prompting
Hybrid training pipeline combining reasoning distillation with RLVR specifically for reward modeling tasks

Modeling

Base Model: Qwen-2.5-Instruct (7B, 14B, 32B) and DeepSeek-R1-Distill-Qwen (14B, 32B)

Training Method: Two-stage process: Supervised Fine-Tuning (Distillation) followed by Reinforcement Learning (GRPO)

Objective Functions:

Purpose: Distill high-quality reasoning traces from oracle models.

Formally: Minimize NLL loss L_distill = - sum log p(j_t | j_<t, x, y_a, y_b)
Purpose: Optimize policy to generate correct judgments using RL.

Formally: GRPO objective maximizing expected reward R(x, j) - beta * KL(pi || pi_ref)

Training Data:

Skywork Reward Preference 80K (cleaned subset)
Code-Preference-Pairs (8K)
Math-DPO-10K (10K)
Distillation subset: 8.7K examples with synthesized CoT from Oracle (o3/Claude-3.7)

Key Hyperparameters:

reward_function: Correctness of the final label l_hat (1 if correct, 0 otherwise)
kl_coefficient: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. ScalarRMs: RM-R1 provides interpretable reasoning traces and handles complex tasks better via CoT.
vs. Standard GenRMs: RM-R1 uses a specialized Chain-of-Rubrics rollout and RLVR training, whereas standard GenRMs often lack structured reasoning training.
vs. Self-taught Evaluator: RM-R1 uses a structured distillation + RL pipeline rather than just iterative self-improvement via rejection sampling.

Limitations

Distillation relies on proprietary oracle models (o3, Claude-3.7).
Inference cost is higher than ScalarRMs due to long reasoning chain generation.
Performance gains on simple tasks might not justify the increased compute cost compared to ScalarRMs.

Reproducibility

Code: https://github.com/RM-R1-UIUC/RM-R1

Publicly available: Code, data, and models at https://github.com/RM-R1-UIUC/RM-R1. The paper mentions using o3 or Claude-3-7-sonnet for distillation data generation, which are closed-source dependencies.

📊 Experiments & Results

Evaluation Setup

Pairwise preference prediction on standard reward model benchmarks.

Benchmarks:

RewardBench (General Chat, Coding, Math, Safety)
RM-Bench (Reasoning-intensive (Math, Code))
RMB (General Preference)

Metrics:

Accuracy (Macro Average across subsets)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RM-R1 models outperform both open-weight and proprietary baselines on average across three benchmarks.
Average (RewardBench, RM-Bench, RMB)	Accuracy	86.1	88.6	+2.5
Average (RewardBench, RM-Bench, RMB)	Accuracy	83.7	88.6	+4.9
RM-Bench (Math Subset)	Accuracy	73.0	91.8	+18.8
RM-Bench (Code Subset)	Accuracy	63.0	74.1	+11.1
Ablation studies show that Distillation + RL + Rubrics + QC (Full RM-R1) provides the best performance.
RewardBench	Accuracy	88.6	90.7	+2.1
RM-Bench	Accuracy	59.2	72.0	+12.8

Main Takeaways

Scaling inference compute (token budget) linearly improves reward model performance, akin to reasoning models.
Larger models yield greater performance gains from the reasoning-based training pipeline, supporting a scaling law for ReasRMs.
Reasoning-based training (distillation + RL) consistently outperforms answer-only SFT, even when controlling for data size.
The Chain-of-Rubrics mechanism is crucial for bridging the gap between general chat evaluation and rigorous reasoning tasks.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Chain-of-Thought (CoT) prompting
Distillation in LLMs
Proximal Policy Optimization (PPO) or similar RL algorithms

Key Terms

RM: Reward Model—a model that evaluates the quality of LLM responses, used to guide RLHF.

GenRM: Generative Reward Model—an RM that generates text (reasoning + score) rather than just a scalar score.

RLVR: Reinforcement Learning with Verifiable Rewards—an RL technique where the reward signal is based on objectively verifiable outcomes (like correct formatting or answers).

CoR: Chain-of-Rubrics—a prompting strategy where the model explicitly generates evaluation criteria (rubrics) or solutions before assigning a score.

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy based on the relative performance of a group of generated outputs.

Distillation: Training a smaller student model to mimic the outputs (reasoning traces) of a larger, stronger teacher model.

NLL: Negative Log-Likelihood—a standard loss function used to train language models to predict the next token.