PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Mathematical Reasoning Reward Modeling / Verification

PRIME is a benchmark containing 2,530 expert-annotated STEM problems designed to penalize 'lucky guesses'—correct answers with flawed reasoning—demonstrating that process-aware verifiers significantly boost downstream RLVR performance.

Core Problem

Current outcome-centric verifiers check only if the final answer matches the ground truth, failing to detect flawed derivations that coincidentally yield the correct result.

Why it matters:

Reinforcement Learning with Verifiable Rewards (RLVR) relies on accurate reward signals; rewarding 'lucky guesses' reinforces incorrect reasoning patterns
Rule-based verifiers struggle with flexible output formats, while existing model-based benchmarks neglect the derivation process
Approximately 17% of correct model responses in STEM tasks are actually 'lucky guesses' with flawed logic, which outcome-only verifiers miss

Concrete Example: A model calculates the area of a circle with radius $r=2$. It incorrectly uses the circumference formula $2\pi r$ instead of area $\pi r^2$. Coincidentally, $2\pi(2) = 4\pi$ and $\pi(2^2) = 4\pi$. An outcome-only verifier marks this 'Correct', reinforcing the wrong formula, while PRIME's process-aware verifier rejects it.

Key Novelty

Process-Outcome Alignment Benchmark (PRIME)

Constructs a dataset of hard-to-verify STEM problems where models often get the right answer for the wrong reasons
Uses a 'Consensus Score' filtering mechanism to select only samples where proxy verifiers disagree, ensuring the benchmark targets the decision boundary
Validates verifiers not just on answer extraction, but on their ability to enforce logical consistency between the step-by-step derivation and the final result

Evaluation Highlights

+9.12% absolute accuracy gain on AIME 2025 for Qwen3-14B-Base when trained with a process-aware verifier selected via PRIME compared to an outcome-only baseline
Strong linear correlation (R² > 0.92) between a verifier's accuracy on PRIME and the downstream performance improvement of the RLVR-trained model
Identifies that ~17% of 'correct' answers in raw STEM generation are actually 'lucky guesses' with flawed reasoning

Breakthrough Assessment

9/10

Addresses a critical, often-overlooked failure mode in reasoning (spurious correctness). The strong correlation (R² > 0.92) between benchmark score and downstream training gain validates it as a high-utility tool for the community.

⚙️ Technical Details

Problem Definition

Setting: Verification of mathematical and engineering reasoning trajectories

Inputs: A triplet $(q, a, r)$ consisting of a question, a ground truth answer, and a model-generated reasoning trajectory

Outputs: A binary label $y_{overall} \in \{0, 1\}$ indicating if the trajectory $r$ is logically sound AND results in the correct answer $a$

Pipeline Flow

Data Collection (STEM Textbooks/Exams) -> Verifiability Filtering
Response Generation (Diverse LRMs) -> Verification Difficulty Filtering
Expert Labeling (Process & Outcome)

System Modules

Verifiability Checker

Filters out open-ended or ambiguous questions lacking a unique ground truth

Model or implementation: GPT-OSS-120B

Consensus Filter

Identifies 'Hard-to-Verify' samples where proxy verifiers are inconsistent

Model or implementation: GPT-OSS-120B (Proxy Verifier)

Human Expert Annotator

Provides ground truth labels for both outcome correctness and process validity

Model or implementation: Human Domain Experts (18 experts)

Novel Architectural Elements

Verification Difficulty Filtering: A selection mechanism based on the variance of model-based verification (Consensus Score) to target the decision boundary of current verifiers
Dual-Label Annotation Scheme: Explicitly separating 'Outcome Validity' from 'Overall Validity' to isolate derivation flaws

Modeling

Base Model: Qwen3-8B-Base and Qwen3-14B-Base (used as policy models for RLVR experiments)

Training Method: Reinforcement Learning with Verifiable Rewards (RLVR) using PPO

Objective Functions:

Purpose: Reward the model only if both the answer is correct AND the process is verified as consistent.

Formally: R(x, y, a*) = I(Outcome) * I(Process), where I(Process) is the verifier's judgment.

Training Data:

WebInstruct-verified dataset used for training
PRIME used for verifier selection/evaluation (not direct training)

Key Hyperparameters:

global_batch_size: 64
rollout_number: 16
training_steps: 300
+ 1 more
algorithm: PPO

Compute: Cluster of 40 H800 GPUs

Comparison to Prior Work

vs. Math-Verify: PRIME handles flexible formats and detects logical errors that strict matching misses
vs. CompassVerifier: PRIME penalizes correct answers derived from wrong steps ('lucky guesses'), whereas CompassVerifier rewards them
vs. PRM800K [not cited in paper]: PRIME focuses on the alignment of the *entire* process with the outcome for final reward signal, rather than dense step-by-step rewards, serving as a verifier benchmark rather than just a training dataset

Limitations

Relies on existing strong models (like GPT-OSS-120B) as verifiers rather than training a specialized verifier from scratch
Benchmark is static (2,530 samples); may eventually be saturated by larger models
Focuses on STEM domains; may not generalize to creative writing or open-ended tasks without ground truth

Reproducibility

Code: https://github.com/wonderful9462/PRIME

Benchmark and code released at https://github.com/wonderful9462/PRIME. Full list of questions and expert annotations provided. Policy training details (batch size, rollouts, GPUs) provided, but exact learning rates omitted.

📊 Experiments & Results

Evaluation Setup

Verify the correctness of solution trajectories generated by diverse models

Benchmarks:

PRIME (Process-Outcome Verification) [New]
AIME 2024 / 2025 (Advanced Math Reasoning (RLVR downstream eval))
GPQA-Diamond (Scientific Reasoning (RLVR downstream eval))

Metrics:

Overall Accuracy (Process-Aware)
Outcome Accuracy
Pass@1 (Downstream)
Average@16 (Downstream for AIME)
Statistical methodology: Linear regression analysis (R²) to correlate verifier accuracy with downstream performance

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RLVR experiments compare the downstream performance of models trained with Outcome-only verifiers vs. Process-aware verifiers (selected via PRIME).
AIME 2025	Average@16 Accuracy	21.61	30.73	+9.12
AIME 2024	Average@16 Accuracy	28.85	37.14	+8.29
GPQA-Diamond	Accuracy	45.17	50.66	+5.49
Benchmark evaluation shows the performance gap between outcome checking and full process verification.
PRIME	Overall Accuracy	63.67	78.33	+14.66
PRIME (Raw Data)	Lucky Guess Rate	0	16.98	16.98

Main Takeaways

Process-aware verification is critical: Relying solely on outcome correctness reinforces 'lucky guesses', limiting RLVR performance.
High correlation (R² > 0.92) exists between a verifier's ability to pass PRIME and its effectiveness in training downstream policy models.
Reasoning capabilities are a prerequisite for verification: Models with explicit 'thinking' or reasoning abilities consistently outperform instruction-tuned models on the PRIME benchmark.
Open-source models are closing the gap: Large open weights models like GPT-OSS-120B are becoming competitive with closed-source verifiers.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Large Language Models (LLMs) for reasoning
Process Reward Models (PRMs)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—a training method where models improve by receiving feedback based on the objective correctness of their answers

Lucky Guess: A scenario where a reasoning model arrives at the correct final answer despite using incorrect logic, formulas, or derivation steps

Consensus Score: A metric used during dataset construction: the average agreement rate of a proxy verifier across multiple trials, used to identify 'Hard-to-Verify' samples

Process-Outcome Alignment: The requirement that a correct final answer must be the result of a logically valid derivation process

AIME: American Invitational Mathematics Examination—a challenging math competition used as a benchmark for advanced reasoning capabilities

GPQA: Google-Proof Q&A—a difficult science benchmark designed to be resistant to simple web search

SFT: Supervised Fine-Tuning—training a model on a dataset of correct examples before applying reinforcement learning

PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm used to update the model's policy based on reward signals