Reinforcing General Reasoning without Verifiers

📝 Paper Summary

Reinforcement Learning for Reasoning LLM Reasoning Verifier-Free Training

VeriFree extends DeepSeek-R1-Zero-style reasoning training to general domains by replacing external verifiers with a likelihood-based objective that treats the correct answer as a probabilistic target given a generated chain of thought.

Core Problem

DeepSeek-R1-Zero-style reinforcement learning relies on rule-based verifiers (valid for math/code), but cannot extend to general domains (law, chemistry) where verification is hard, often forcing reliance on slow, exploitable model-based verifiers.

Why it matters:

Rule-based verification is impossible for many real-world tasks (e.g., 'What is the capital of France?' cannot be executed like code).
Using a separate LLM as a verifier (reward model) is computationally expensive, memory-intensive, and prone to reward hacking.
Existing variational approaches for latent reasoning (like JLB or LaTRO) underperform standard R1-Zero methods.

Concrete Example: In a math problem, a rule-based verifier checks if 'boxed{8/5}' equals '1.6'. In a chemistry question asking 'Which element...', no simple rule exists. Standard approaches would require a heavy second LLM to judge correctness, whereas VeriFree computes the probability of the ground truth token directly from the policy.

Key Novelty

Verifier-Free (VeriFree) Policy Optimization

Treats the reasoning trace as a latent variable and optimizes the policy to maximize the likelihood of the ground-truth answer given that trace.
Derives a gradient estimator that is mathematically equivalent to standard RL with a perfect verifier (assuming a unique correct answer) but has analytically lower variance via Rao-Blackwellization.
Eliminates the need for a separate reward model or rule-based checker by using the policy itself to score the consistency between its reasoning and the reference answer.

Architecture

Comparison of Standard RLHF/RLVR pipeline versus the proposed VeriFree pipeline.

Evaluation Highlights

Outperforms verifier-based RL baselines on general reasoning benchmarks: +3.0% accuracy on MMLU-Pro compared to RL with a learnt verifier.
Matches or exceeds performance on math tasks where verifiers exist: +1.2% on MATH benchmark compared to standard R1-Zero style training.
Achieves these gains with significantly reduced compute and memory overhead by removing the need for an external verifier model during training.

Breakthrough Assessment

8/10

Elegantly solves the 'verification bottleneck' for general reasoning. By proving equivalence to RLVR with lower variance, it enables R1-style reasoning gains in non-formal domains without the cost/fragility of reward models.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning on Language Models for reasoning tasks

Inputs: Input question x

Outputs: Reasoning trace z and final answer y

Pipeline Flow

Input Processing: Question x
Reasoning Generation: Model generates trace z (stopped before answer)
Gradient Estimation: Compute likelihood of reference answer y* given z
Optimization: Update policy parameters

System Modules

Policy Model (LLM)

Generates the reasoning trace z conditioned on input x.

Model or implementation: Qwen-2.5-7B-Instruct / Llama-3.1-8B-Instruct

Likelihood Evaluator

Calculates the probability of the ground truth answer y* given the generated trace z.

Model or implementation: Same Policy Model (Self-evaluation)

Novel Architectural Elements

Integration of implicit verification: The policy model itself acts as the verifier by computing the likelihood of the correct answer, removing the external reward model module entirely.
Reasoning-Answer Patching: A specific tokenization handling strategy where generation stops exactly at '<answer' to allow seamless concatenation of the reference answer for likelihood computation.

Modeling

Base Model: Qwen-2.5-7B-Instruct and Llama-3.1-8B-Instruct

Training Method: VeriFree (Variance-reduced Policy Gradient)

Objective Functions:

Purpose: Maximize expected probability of correct answer by reinforcing good reasoning traces.

Formally: ∇J ≈ E[ (R_i - Baseline) * ∇ log π(z|x) + ∇ log π(y*|x,z) ] where R_i = π(y*|x,z).

Key Hyperparameters:

learning_rate: 5e-7 (Qwen), 1e-6 (Llama)
global_batch_size: 256
rollout_n: 8 (number of samples per prompt)
+ 3 more
kl_coefficient: 0.01 (Qwen), 0.005 (Llama)
max_new_tokens: 2048
prompt_max_length: 1024

Compute: Requires significantly less memory than verifier-based methods (which hold a separate 7B+ verifier in memory). Trained on H800 GPUs.

Comparison to Prior Work

vs. DeepSeek-R1-Zero: VeriFree works on general domains (no rule-based verifier needed) and has lower variance.
vs. Model-based Verifiers (RLHF style): VeriFree requires no separate reward model in memory and avoids reward hacking against a frozen proxy.
vs. JLB / LaTRO: VeriFree matches/outperforms verifier methods, whereas JLB/LaTRO historically underperform due to objective mismatch (e.g., JLB uses log-prob as reward, VeriFree uses probability).
+ 1 more
vs. STaR [not cited in paper]: STaR filters samples via correctness (binary) for SFT; VeriFree uses soft probabilistic rewards for RL updates.

Limitations

Assumption of single correct answer in derivation (though shown robust to multiple valid answers in practice).
Relies on the presence of ground truth answers during training (cannot learn from unlabelled data).
Requires careful handling of tokenization at the boundary between reasoning trace and reference answer.

Reproducibility

Code: https://github.com/sail-sg/VeriFree

Code is publicly available at https://github.com/sail-sg/VeriFree. Hyperparameters for Qwen and Llama models are detailed in the paper. The method relies on standard RL frameworks (GRPO/RLOO) but modifies the gradient estimator.

📊 Experiments & Results

Evaluation Setup

Greedy decoding for final evaluation; RL training on diverse reasoning tasks.

Benchmarks:

MMLU-Pro (General Reasoning (Multi-choice))
GPQA (Graduate-Level Science QA)
SuperGPQA (Hard Science QA)
MATH (Mathematical Problem Solving)
GSM8K (Grade School Math)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
VeriFree consistently outperforms baselines on general reasoning benchmarks (MMLU-Pro, GPQA) using Qwen-2.5-7B-Instruct.
MMLU-Pro	Accuracy	53.6	56.6	+3.0
GPQA	Accuracy	39.7	45.1	+5.4
On verifiable domains (Math), VeriFree matches or slightly exceeds the standard R1-Zero (Rule-based Verifier) approach.
MATH	Accuracy	69.0	70.2	+1.2
GSM8K	Accuracy	88.6	88.8	+0.2

Experiment Figures

Radar chart comparing VeriFree against SFT and Verification-based RL baselines across 5 benchmarks.

Main Takeaways

VeriFree achieves superior performance on general reasoning tasks compared to using a separate LLM as a verifier, avoiding reward hacking and reducing compute.
The method matches or exceeds standard Rule-based RL (R1-Zero) on math benchmarks, proving it is a robust generalization of the RLVR paradigm.
Empirically robust to 'semantic equivalence' issues: training with a single reference answer works well even when multiple valid formats exist.
Ablations show that weighting the supervised term by the reward (likelihood) is crucial; removing this weighting (as done in JLB/LaTRO) degrades performance.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients)
Chain of Thought (CoT) prompting
DeepSeek-R1-Zero training paradigm
Variational Inference

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training method where rewards are binary (correct/incorrect) based on a rule-based checker (e.g., for math or code).

R1-Zero-style training: A paradigm introduced by DeepSeek-R1-Zero that uses RLVR to induce reasoning capabilities (Chain of Thought) without supervised demonstrations.

CoT: Chain of Thought—intermediate reasoning steps generated by the model before the final answer.

GRPO: Group Relative Policy Optimization—a simplified RL algorithm (variant of PPO) used in DeepSeek-R1-Zero that normalizes advantages within a group of samples.

RLOO: Reinforce Leave-One-Out—a variance reduction technique for policy gradients where the baseline for a sample is the average reward of other samples in the same batch.

Rao-Blackwellization: A statistical technique to reduce the variance of an estimator by taking its expectation conditioned on a sufficient statistic (here, marginalizing out the final answer y).

JLB: Jensen's Lower Bound—a variational lower bound objective used in prior work like Tang et al. [40] for latent reasoning.

LaTRO: Latent Reasoning Optimization—another variational approach (Chen et al. [4]) using a fixed reference policy.

Policy Gradient: An optimization technique where the model's parameters are updated to increase the probability of actions that yield high rewards.

PPO: Proximal Policy Optimization—a standard RL algorithm that prevents the policy from changing too drastically in one step.