Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Self-Verification in LLMs

RISE trains language models to simultaneously solve problems and verify their own solutions within a single online reinforcement learning loop using verifiable rewards for both tasks.

Core Problem

Even with outcome-based reinforcement learning, models suffer from 'superficial self-reflection,' where they learn to generate correct answers without developing the ability to robustly verify or critique their own reasoning.

Why it matters:

Models may stumble upon correct answers without true understanding, leading to brittleness.
Current approaches decouple problem-solving training from verification training, preventing the model from learning to self-correct effectively during the reasoning process.
Reliable self-verification is essential for deploying LLMs in high-stakes domains like mathematics and coding where correctness is paramount.

Concrete Example: A model might output a correct final answer for a math problem but use flawed logic in the steps. Without explicit verification training, it cannot detect these internal inconsistencies. Conversely, a standard RL model might solve a problem correctly but, when asked to verify it, hallucinate a critique saying the correct answer is wrong.

Key Novelty

RISE (Reinforcing Reasoning with Self-Verification)

Integrates self-verification directly into the online RL loop: for every batch of generated solutions, the model immediately generates verification critiques for those same solutions.
Uses the same rule-based outcome verifier (ground truth) to reward both the problem-solving accuracy and the verification accuracy (whether the model's critique score matches the actual correctness).
Updates the policy using a combined trajectory of reasoning and verifying, forcing the model to align its generation capabilities with its assessment capabilities.

Architecture

The RISE framework's online reinforcement learning loop.

Evaluation Highlights

RISE-3B achieves a 3.7% average improvement in reasoning accuracy over Qwen-3B-Instruct on mathematical benchmarks.
RISE achieves up to a 2.8x increase in verification accuracy compared to a Zero-RL baseline.
At test time, RISE-7B outperforms standard majority voting by +1.9% under a k=4 inference budget.

Breakthrough Assessment

8/10

Strong conceptual advance by unifying generation and verification in a single online RL process. Significant empirical gains in verification reliability without sacrificing reasoning performance.

⚙️ Technical Details

Problem Definition

Setting: Online Reinforcement Learning for Mathematical Reasoning

Inputs: Math problem prompt x

Outputs: Solution y (reasoning chain + answer) AND Verification critique y_ver (reasoning + score)

Pipeline Flow

Problem Generation Phase: Model generates solution y for problem x
Reward Calculation: Outcome Verifier (OV) calculates reward r based on y
Verification Construction: System constructs verification prompt x_ver using x and y
Verification Phase: Model generates critique and score y_ver for x_ver
Joint Update: PPO updates model using both generation and verification trajectories

System Modules

Actor Model (Policy)

Generates both problem solutions and verification critiques

Model or implementation: Qwen2.5 (1.5B, 3B, 7B variants)

Outcome Verifier (OV)

Provides deterministic rewards for both tasks

Model or implementation: Rule-based function

Critic Model

Estimates value function for PPO

Model or implementation: Same architecture as Actor (initialized from it)

Novel Architectural Elements

Unified online data generation loop where verification instances are constructed on-the-fly from the current policy's own problem-solving attempts
Dual-objective PPO update where a single optimization step processes combined batches of solution generation and self-verification trajectories

Modeling

Base Model: Qwen2.5 (1.5B, 3B, 7B) Base Models

Training Method: Proximal Policy Optimization (PPO)

Objective Functions:

Purpose: Maximize expected reward while limiting policy deviation.

Formally: Standard PPO clipped surrogate objective.
Purpose: Update critic to estimate values accurately.

Formally: Mean Squared Error (MSE) between predicted value and actual return.
Purpose: Unify supervision for generation and verification.

Formally: Advantage computed via GAE on concatenated batch B = G U V (Generation U Verification).

Training Data:

Generation Batch G: (x, y, r)
Verification Batch V: (x_ver, y_ver, r_ver matches r)

Key Hyperparameters:

gae_lambda: 1
discount_factor_gamma: 1
algorithm: PPO

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-Math: RISE explicitly adds a verification objective to the RL loop, whereas DeepSeek-Math relies on outcome rewards for solutions only.
vs. Offline Verifiers (CORM/V-STaR): RISE trains verification online using on-policy data, preventing distribution shift between the generator and the verifier.
vs. Self-Refine [not cited in paper]: RISE updates model weights via RL to improve verification capabilities, whereas Self-Refine relies on the frozen model's inherent ability to critique.

Limitations

Relies on ground truth answers being available for reward calculation (limited to domains like math/code).
Doubles the inference compute during training (generating solutions + generating verifications).
Verification training depends on the diversity of the generated solutions; if the model is too perfect or too wrong initially, verification signals might be sparse or biased.

Reproducibility

No replication artifacts (code, weights) mentioned in the paper. Method is described algorithmically. Outcome verifier logic is standard for math reasoning tasks.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning benchmarks

Benchmarks:

GSM8K (Grade school math word problems)
MATH (Challenging competition-level math problems)
GaoKao (Chinese college entrance exam math questions)
OlympiadBench (Olympiad-level math and physics problems)

Metrics:

Reasoning Accuracy (Pass@1)
Verification Accuracy (Accuracy of predicting correctness)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison against Instruction-Tuned Baselines shows RISE improves both reasoning and verification across scales.
Average (GSM8K, MATH, GaoKao, OlympiadBench)	Reasoning Accuracy	63.7	67.4	+3.7
Average (GSM8K, MATH, GaoKao, OlympiadBench)	Verification Accuracy	56.4	89.8	+33.4
Average (GSM8K, MATH, GaoKao, OlympiadBench)	Reasoning Accuracy	73.5	74.8	+1.3
Ablation on Verification Mechanism highlights importance of online training.
MATH	Verification Accuracy	51.2	90.7	+39.5
MATH	Verification Accuracy	73.2	90.7	+17.5

Experiment Figures

Analysis of True Positive/True Negative rates in verification.

Main Takeaways

RISE consistently improves reasoning accuracy while dramatically boosting self-verification skills compared to standard RLVR.
Online verification (training on the model's own current outputs) is crucial; offline verification training is significantly less effective due to distribution shift.
Enhanced verification leads to better test-time performance when using verification-guided decoding (e.g., majority voting or best-of-k), outperforming standard majority voting.
The method scales effectively across model sizes (1.5B, 3B, 7B), showing consistent gains.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Proximal Policy Optimization (PPO)
Chain-of-Thought (CoT) Reasoning

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—RL where rewards are determined by programmatic checks (e.g., correct math answer, passing unit tests).

Outcome Verifier: A deterministic function that checks if a model's final answer matches the ground truth.

PPO: Proximal Policy Optimization—a policy gradient RL algorithm that prevents drastic policy updates using a clipped objective function.

Online RL: Training where the model updates its policy based on data it generates in real-time during training, rather than static pre-collected data.

GAE: Generalized Advantage Estimation—a method to estimate the advantage function (how good an action is) by balancing bias and variance.

Superficial self-reflection: A failure mode where models appear to critique their work but lack the genuine ability to identify errors, often verifying based on surface features rather than logic.

Pass@k: A metric measuring the probability that at least one correct solution is generated within k attempts.