Think Twice: Branch-and-Rethink Reasoning Reward Model

📝 Paper Summary

Reward Modeling Reinforcement Learning from Human Feedback (RLHF) Reasoning Language Models

BR-RM improves reward modeling by forcing the judge to first select critical evaluation dimensions and then perform a targeted second-pass analysis, reducing the dilution of attention common in single-pass models.

Core Problem

Standard scalar Reward Models (RMs) suffer from 'judgment diffusion': by trying to evaluate all quality criteria in a single pass, they spread attention too thin and fail to catch subtle errors.

Why it matters:

Current judges often miss quiet factual slips or local logic bugs because they lack the depth of focus needed for complex reasoning tasks
Existing generative RMs typically collapse critiques into a single global decision without instance-adaptive focus, retaining the 'all-at-once' pressure of scalar models

Concrete Example: When a Reasoning RM evaluates a complex response, it often allocates tokens evenly across all criteria (e.g., style, safety, correctness). Consequently, it might miss a subtle hallucination in a math derivation because it didn't dedicate specific compute to verifying that single step, resulting in 'shallow analysis'.

Key Novelty

Branch-and-Rethink Reward Model (BR-RM)

Transfers the 'think twice' principle from solvers to judges: instead of one holistic score, the model executes a two-turn generative trace
Turn 1 (Adaptive Branching) identifies specific risks (e.g., 'Check Factuality'); Turn 2 (Rethinking) executes a deep-dive analysis conditioned solely on those flagged risks

Architecture

Comparison of Scalar RM, Generative RM, and the proposed Branch-and-Rethink (BR-RM) frameworks.

Evaluation Highlights

Achieves 85.9 accuracy on RM-Bench with Qwen-14B, setting a new state-of-the-art and outperforming DeepSeek-V3-based judges
BR-RM-Qwen-8B outperforms significantly larger baselines, including GPT-4o and Llama-3.1-70B-Instruct, on the RMB benchmark (70.1 vs 65.6 for GPT-4o)
Ranks top-2 on RMB (74.7 with 14B model), demonstrating superior consistency across reasoning, knowledge, and safety domains compared to scalar RMs

Breakthrough Assessment

8/10

Significantly advances reward modeling by successfully operationalizing 'system 2' thinking for judges. The structured two-turn approach effectively solves the attention-dilution problem in scalar RMs.

⚙️ Technical Details

Problem Definition

Setting: Pairwise preference ranking: Given prompt x and responses y1, y2, determine the preferred response z

Inputs: Prompt x, Response Pair (y1, y2)

Outputs: Two-turn reasoning trace τ followed by preference label z

Pipeline Flow

Turn 1: Adaptive Branching (Select Criteria + Sketch Issues)
Turn 2: Branch-Conditioned Rethinking (Deep Analysis + Verdict)

System Modules

Adaptive Branching (Turn 1) (Reasoning Generation)

Identify critical evaluation dimensions

Model or implementation: Qwen3-Nemotron (8B/14B)

Branch-Conditioned Rethinking (Turn 2) (Reasoning Generation)

Re-evaluate responses focusing only on selected dimensions

Model or implementation: Qwen3-Nemotron (8B/14B)

Novel Architectural Elements

Two-turn generative dependency: The second generation pass is explicitly conditioned on the 'issue sketch' and 'selected criteria' from the first pass, enforcing a narrowing of scope
Structured trace generation: The model is constrained to output specific sections (Plan, Analysis, Verdict) enforced by format rewards

Modeling

Base Model: Qwen3-Nemotron-8B and Qwen3-Nemotron-14B

Training Method: Generalized Reward Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward while staying close to reference model.

Formally: GRPO objective maximizing E[min(ratio * A, clip(ratio) * A)] - beta * KL(pi_theta || pi_ref)
Purpose: Enforce correct two-stage output structure.

Formally: R_format = lambda * (is_invalid_format ? 1 : 0), where lambda = -100
Purpose: Reward correct preference prediction.

Formally: R_outcome = (prediction == label) ? 1 : 0 (applied only if format is valid)

Training Data:

HelpSteer3
Skywork Reward Preference-80K
Code-Preference-Pairs
Math-Step-DPO-10K

Key Hyperparameters:

learning_rate: 5e-7 (8B) / 1e-6 (14B)
optimization_steps: 400
format_penalty_lambda: -100
+ 2 more
reward_weight: 10
group_size_K: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-GRM: BR-RM uses a structured two-turn process with strict format enforcement, whereas DeepSeek-GRM uses single-turn free-form generation
vs. EvalPlanner: BR-RM constrains planning to a specific set of criteria and uses Online RL (GRPO) instead of iterative DPO
vs. Standard GenRMs: BR-RM enforces a 'second look' conditioned on the first, rather than a single 'explain-then-score' pass

Limitations

Inference cost is higher than scalar RMs due to long-context generation (two turns of reasoning)
Requires high-quality preference data with reasoning potential to learn the branching strategy effectively
Format constraints are rigid; the model must select from a predefined list of 9 criteria rather than generating novel ones

Reproducibility

Code: https://github.com/yzjiao/BR-RM

Publicly available code (https://github.com/yzjiao/BR-RM) and model checkpoints on HuggingFace. Training datasets are public. NeMo-RL library used for training.

📊 Experiments & Results

Evaluation Setup

Pairwise preference prediction across diverse domains (Chat, Reasoning, Safety, Code)

Benchmarks:

RewardBench (General Reward Modeling (Chat, Safety, Reasoning))
RM-Bench (Reasoning-heavy preference evaluation)
RMB (Comprehensive reward modeling benchmark)

Metrics:

Accuracy (Preference Prediction)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on RM-Bench (Reasoning Heavy Benchmark). BR-RM achieves SOTA, surpassing both large scalar models and reasoning models.
RM-Bench	Accuracy	83.9	85.9	+2.0
RM-Bench	Accuracy	73.2	85.9	+12.7
Performance on RMB (Comprehensive Benchmark). BR-RM shows strong generalization, ranking top-2 overall.
RMB	Accuracy	70.5	74.7	+4.2
Performance on RewardBench. While some scalar RMs saturate this benchmark, BR-RM remains highly competitive and balanced.
RewardBench	Accuracy	93.1	94.0	+0.9

Experiment Figures

Analysis of token allocation in existing Reasoning RMs vs BR-RM.

Main Takeaways

Scalar RMs perform well on general chat (RewardBench) but collapse on reasoning-intensive tasks (RM-Bench, RMB), indicating they rely on heuristics.
Reasoning capability scales with size: existing ReasonRMs (like RM-R1) only show gains at 32B+, whereas BR-RM achieves SOTA performance at 14B and strong performance at 8B.
The Branch-and-Rethink strategy allows smaller models (8B) to outperform significantly larger models (GPT-4o, 70B) by allocating compute more efficiently to critical errors.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Chain-of-Thought (CoT) prompting
Proximal Policy Optimization (PPO)

Key Terms

GRPO: Generalized Reward Policy Optimization—a variant of PPO that estimates advantages using group-relative rewards (comparing multiple samples for the same input) without a separate critic network

judgment diffusion: A failure mode where a model's attention is spread too thinly across many evaluation criteria, preventing deep analysis of any single issue

scalar RM: A traditional reward model that outputs a single numerical score for a response in one forward pass, implicitly aggregating all quality dimensions

GenRM: Generative Reward Model—a judge that produces a text critique or rationale before assigning a score

KL divergence: A statistical measure of how one probability distribution differs from a second, reference probability distribution; used here to keep the trained model from drifting too far from the base model