Yuhui Xu, Hanze Dong, Lei Wang, Caiming Xiong, Junnan Li
Salesforce AI Research
arXiv
(2025)
RLReasoningFactuality
📝 Paper Summary
Reward ModelingAI AlignmentReasoning Evaluation
State-of-the-art reward models evaluate the structural coherence and length of reasoning chains rather than valid causal links, evidenced by their inability to distinguish between complete problems and those with the question removed.
Core Problem
Reward models (RMs) are assumed to verify the logical correctness of reasoning, but their internal decision-making mechanisms are opaque; it is unclear if they truly understand the problem or just rely on learned surface patterns.
Why it matters:
RMs are critical for scaling test-time computation and aligning LLMs (RLHF), so their failure to verify causality could lead to 'reward hacking' where models generate coherent but incorrect nonsense
If RMs ignore the problem statement, they cannot reliably evaluate novel or complex reasoning tasks where the question details are crucial for the answer
Concrete Example:When the problem statement is completely deleted from the input (leaving only the solution steps), the reward model's score changes very little, indicating it was never actually checking if the solution answered that specific question.
Key Novelty
Consistency-over-Causality Hypothesis
Demonstrates that RMs prioritize 'structural consistency' (does the solution look like a good solution?) over 'causal correctness' (does this solution answer this question?)
Uses systematic input truncation (removing questions, removing steps) to prove RMs rely heavily on the presence of complete reasoning trajectories rather than problem comprehension
Architecture
Systematic error analysis framework showing how different input truncations (removing question, shuffling, modifying numbers) are fed into the Reward Model to measure deviation from the original score.
Evaluation Highlights
Question truncation (removing the prompt entirely) results in the lowest absolute error in reward scores, implying the question is largely ignored
All-steps truncation (leaving only the final answer) produces the highest errors, showing RMs depend on the 'shape' of reasoning steps to assign value
Modifying numerical values in the question causes significant reward error, indicating RMs check local consistency (do numbers match?) rather than global logic
Breakthrough Assessment
8/10
Provides a crucial negative result about the current state of Reward Modeling, challenging the assumption that RMs act as logical verifiers. The finding that 'questions matter little' is counter-intuitive and significant for the field.
⚙️ Technical Details
Problem Definition
Setting: Evaluation of Reward Model robustness under input perturbations
Inputs: Prompt x (question) and generated response y (reasoning trajectory)
Outputs: Scalar reward score r(x, y) representing the probability of correctness
Pipeline Flow
Input Construction (Standard vs. Truncated/Modified)
Reward Model Inference
Error Analysis
System Modules
Input Modifier
Applies perturbations to the (Question, Solution) pair to test RM sensitivity
Model or implementation: Rule-based scripts
Reward Model
Assigns a scalar score to the (possibly modified) input
Model or implementation: Skywork-o1-OpenPRM (1.5B/7B) or Llama-3.1-8B-ORM/PRM
Modeling
Base Model: Evaluated RMs: Skywork-o1-Open-PRM (based on Qwen-2.5) and Llama-3.1-8B-ORM/PRM
Training Method: The paper analyzes pre-trained reward models; it does not introduce a new training method.
Key Hyperparameters:
temperature: 0.8
top_p: 1.0
num_trajectories: 32 per question
Compute: NVIDIA H100 GPUs using vLLM backend
Comparison to Prior Work
vs. ODIN: This paper diagnoses the 'consistency vs. causality' issue rather than proposing a mitigation technique like ODIN's length disentanglement
vs. Standard Evaluation: Standard RM benchmarks check accuracy on hold-out sets; this paper checks robustness to semantic destruction (removing the question)
Limitations
Analysis is limited to correctness-based reasoning tasks (Math), may not apply to creative writing or safety alignment
Does not propose a new training method to fix the identified causality issues
Relies on specific open-source RMs (Skywork, Llama-3.1), behavior might differ in proprietary closed-source models (GPT-4)
Reproducibility
The paper uses publicly available datasets (GSM8K, MATH500, etc.) and open-weights models (Skywork-o1-OpenPRM, Llama-3, Qwen-2.5). Code is stated as not provided in the snippet.
📊 Experiments & Results
Evaluation Setup
Perturbation analysis on mathematical reasoning tasks
Benchmarks:
GSM8K (Grade school math reasoning)
MATH500 (Challenging math problems)
OlympiadBench (Competition-level math)
GaoKao-2023-En (Chinese college entrance exam (English translation))
Minerva Math (Mathematical reasoning)
Metrics:
Absolute Error |r - r*| (difference between reward for original vs. modified input)
Spearman’s rank correlation coefficient (ρ)
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
Histograms of absolute reward error |r - r*| for different truncation strategies (Question, Initial Steps, Last Steps, All Steps) across datasets.
Error distributions for consistency checks: Question Shuffling and Numerical Value Modification.
Main Takeaways
Question Truncation (removing the problem text) has the least impact on reward scores, suggesting RMs do not strongly condition on the question to verify the solution.
All-steps truncation (leaving only the answer) causes the highest error, indicating intermediate reasoning steps are the primary signal RMs use to judge quality (likely relying on length or pattern matching).
Modifying numerical values in the question disrupts rewards significantly, showing that while RMs ignore the *semantic* question, they are sensitive to *token-level* inconsistencies between question numbers and solution numbers.
Incomplete trajectories (truncated reasoning) cause large reward drops, implying RMs have learned to recognize 'complete-looking' reasoning patterns rather than validating step-by-step logic.
📚 Prerequisite Knowledge
Prerequisites
Understanding of RLHF (Reinforcement Learning from Human Feedback)
Familiarity with Reward Models (ORMs and PRMs)
Basic knowledge of LLM reasoning tasks (GSM8K, MATH)
Key Terms
RM: Reward Model—a model trained to score the quality or correctness of an LLM's output, used to guide generation
ORM: Outcome Reward Model—evaluates the solution based solely on the final result/answer
PRM: Process Reward Model—evaluates the solution step-by-step, providing granular feedback on reasoning
Best-of-N: An inference strategy where the model generates N solutions and the one with the highest Reward Model score is selected
Reward Hacking: When a model exploits flaws in the reward function to get a high score without actually achieving the intended goal (e.g., writing long but empty text)
RLHF: Reinforcement Learning from Human Feedback—a method to align LLMs using a reward model trained on human preferences