Reward Models Identify Consistency, Not Causality

📝 Paper Summary

Reward Modeling AI Alignment Reasoning Evaluation

State-of-the-art reward models evaluate the structural coherence and length of reasoning chains rather than valid causal links, evidenced by their inability to distinguish between complete problems and those with the question removed.

Core Problem

Reward models (RMs) are assumed to verify the logical correctness of reasoning, but their internal decision-making mechanisms are opaque; it is unclear if they truly understand the problem or just rely on learned surface patterns.

Why it matters:

RMs are critical for scaling test-time computation and aligning LLMs (RLHF), so their failure to verify causality could lead to 'reward hacking' where models generate coherent but incorrect nonsense
If RMs ignore the problem statement, they cannot reliably evaluate novel or complex reasoning tasks where the question details are crucial for the answer

Concrete Example: When the problem statement is completely deleted from the input (leaving only the solution steps), the reward model's score changes very little, indicating it was never actually checking if the solution answered that specific question.

Key Novelty

Consistency-over-Causality Hypothesis

Demonstrates that RMs prioritize 'structural consistency' (does the solution look like a good solution?) over 'causal correctness' (does this solution answer this question?)
Uses systematic input truncation (removing questions, removing steps) to prove RMs rely heavily on the presence of complete reasoning trajectories rather than problem comprehension

Architecture

Systematic error analysis framework showing how different input truncations (removing question, shuffling, modifying numbers) are fed into the Reward Model to measure deviation from the original score.

Evaluation Highlights

Question truncation (removing the prompt entirely) results in the lowest absolute error in reward scores, implying the question is largely ignored
All-steps truncation (leaving only the final answer) produces the highest errors, showing RMs depend on the 'shape' of reasoning steps to assign value
Modifying numerical values in the question causes significant reward error, indicating RMs check local consistency (do numbers match?) rather than global logic

Breakthrough Assessment

8/10

Provides a crucial negative result about the current state of Reward Modeling, challenging the assumption that RMs act as logical verifiers. The finding that 'questions matter little' is counter-intuitive and significant for the field.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of Reward Model robustness under input perturbations

Inputs: Prompt x (question) and generated response y (reasoning trajectory)

Outputs: Scalar reward score r(x, y) representing the probability of correctness

Pipeline Flow

Input Construction (Standard vs. Truncated/Modified)
Reward Model Inference
Error Analysis

System Modules

Input Modifier

Applies perturbations to the (Question, Solution) pair to test RM sensitivity

Model or implementation: Rule-based scripts

Reward Model

Assigns a scalar score to the (possibly modified) input

Model or implementation: Skywork-o1-OpenPRM (1.5B/7B) or Llama-3.1-8B-ORM/PRM

Modeling

Base Model: Evaluated RMs: Skywork-o1-Open-PRM (based on Qwen-2.5) and Llama-3.1-8B-ORM/PRM

Training Method: The paper analyzes pre-trained reward models; it does not introduce a new training method.

Key Hyperparameters:

temperature: 0.8
top_p: 1.0
num_trajectories: 32 per question

Compute: NVIDIA H100 GPUs using vLLM backend

Comparison to Prior Work

vs. ODIN: This paper diagnoses the 'consistency vs. causality' issue rather than proposing a mitigation technique like ODIN's length disentanglement
vs. Standard Evaluation: Standard RM benchmarks check accuracy on hold-out sets; this paper checks robustness to semantic destruction (removing the question)

Limitations

Analysis is limited to correctness-based reasoning tasks (Math), may not apply to creative writing or safety alignment
Does not propose a new training method to fix the identified causality issues
Relies on specific open-source RMs (Skywork, Llama-3.1), behavior might differ in proprietary closed-source models (GPT-4)

Reproducibility

The paper uses publicly available datasets (GSM8K, MATH500, etc.) and open-weights models (Skywork-o1-OpenPRM, Llama-3, Qwen-2.5). Code is stated as not provided in the snippet.

📊 Experiments & Results

Evaluation Setup

Perturbation analysis on mathematical reasoning tasks

Benchmarks:

GSM8K (Grade school math reasoning)
MATH500 (Challenging math problems)
OlympiadBench (Competition-level math)
GaoKao-2023-En (Chinese college entrance exam (English translation))
Minerva Math (Mathematical reasoning)

Metrics:

Absolute Error |r - r*| (difference between reward for original vs. modified input)
Spearman’s rank correlation coefficient (ρ)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Histograms of absolute reward error |r - r*| for different truncation strategies (Question, Initial Steps, Last Steps, All Steps) across datasets.

Error distributions for consistency checks: Question Shuffling and Numerical Value Modification.

Main Takeaways

Question Truncation (removing the problem text) has the least impact on reward scores, suggesting RMs do not strongly condition on the question to verify the solution.
All-steps truncation (leaving only the answer) causes the highest error, indicating intermediate reasoning steps are the primary signal RMs use to judge quality (likely relying on length or pattern matching).
Modifying numerical values in the question disrupts rewards significantly, showing that while RMs ignore the *semantic* question, they are sensitive to *token-level* inconsistencies between question numbers and solution numbers.
Incomplete trajectories (truncated reasoning) cause large reward drops, implying RMs have learned to recognize 'complete-looking' reasoning patterns rather than validating step-by-step logic.

📚 Prerequisite Knowledge

Prerequisites

Understanding of RLHF (Reinforcement Learning from Human Feedback)
Familiarity with Reward Models (ORMs and PRMs)
Basic knowledge of LLM reasoning tasks (GSM8K, MATH)

Key Terms

RM: Reward Model—a model trained to score the quality or correctness of an LLM's output, used to guide generation

ORM: Outcome Reward Model—evaluates the solution based solely on the final result/answer

PRM: Process Reward Model—evaluates the solution step-by-step, providing granular feedback on reasoning

Best-of-N: An inference strategy where the model generates N solutions and the one with the highest Reward Model score is selected

Reward Hacking: When a model exploits flaws in the reward function to get a high score without actually achieving the intended goal (e.g., writing long but empty text)

RLHF: Reinforcement Learning from Human Feedback—a method to align LLMs using a reward model trained on human preferences