← Back to Paper List

PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

Xiangfeng Wang, Hangyu Guo, Yanlin Lai, Mitt Huang, Liang Zhao, Chengyuan Yao, Yinmin Zhang, Qi Han, Xiaoxiao Ren, Chun Yuan, Tong Xu, Zheng Ge, Xiangyu Zhang, Daxin Jiang
University of Science and Technology of China, StepFun, Tsinghua University
arXiv (2026)
Benchmark Reasoning RL Factuality

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Mathematical Reasoning Reward Modeling / Verification
PRIME is a benchmark containing 2,530 expert-annotated STEM problems designed to penalize 'lucky guesses'—correct answers with flawed reasoning—demonstrating that process-aware verifiers significantly boost downstream RLVR performance.
Core Problem
Current outcome-centric verifiers check only if the final answer matches the ground truth, failing to detect flawed derivations that coincidentally yield the correct result.
Why it matters:
  • Reinforcement Learning with Verifiable Rewards (RLVR) relies on accurate reward signals; rewarding 'lucky guesses' reinforces incorrect reasoning patterns
  • Rule-based verifiers struggle with flexible output formats, while existing model-based benchmarks neglect the derivation process
  • Approximately 17% of correct model responses in STEM tasks are actually 'lucky guesses' with flawed logic, which outcome-only verifiers miss
Concrete Example: A model calculates the area of a circle with radius $r=2$. It incorrectly uses the circumference formula $2\pi r$ instead of area $\pi r^2$. Coincidentally, $2\pi(2) = 4\pi$ and $\pi(2^2) = 4\pi$. An outcome-only verifier marks this 'Correct', reinforcing the wrong formula, while PRIME's process-aware verifier rejects it.
Key Novelty
Process-Outcome Alignment Benchmark (PRIME)
  • Constructs a dataset of hard-to-verify STEM problems where models often get the right answer for the wrong reasons
  • Uses a 'Consensus Score' filtering mechanism to select only samples where proxy verifiers disagree, ensuring the benchmark targets the decision boundary
  • Validates verifiers not just on answer extraction, but on their ability to enforce logical consistency between the step-by-step derivation and the final result
Evaluation Highlights
  • +9.12% absolute accuracy gain on AIME 2025 for Qwen3-14B-Base when trained with a process-aware verifier selected via PRIME compared to an outcome-only baseline
  • Strong linear correlation (R² > 0.92) between a verifier's accuracy on PRIME and the downstream performance improvement of the RLVR-trained model
  • Identifies that ~17% of 'correct' answers in raw STEM generation are actually 'lucky guesses' with flawed reasoning
Breakthrough Assessment
9/10
Addresses a critical, often-overlooked failure mode in reasoning (spurious correctness). The strong correlation (R² > 0.92) between benchmark score and downstream training gain validates it as a high-utility tool for the community.
×