VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Post-training for Reasoning

VI-CuRL stabilizes verifier-free reinforcement learning by using the model's intrinsic confidence to select easier, low-variance samples early in training, progressively introducing harder ones.

Core Problem

Verifier-free RL for reasoning suffers from destructive gradient variance because there are no external signals to prune incorrect trajectories, leading to training collapse.

Why it matters:

Standard RLVR relies on ground-truth verifiers (e.g., math compilers), which are expensive or impossible to obtain for open-ended tasks like creative writing or general logic
Without verifiers, the variance in gradients becomes unmanageably large, causing models to unlearn capabilities rather than improve
Existing curriculum methods still depend on verifier signals, failing to address the core issue in truly verifier-independent settings

Concrete Example: In verifier-free scenarios, standard Group Relative Policy Optimization (GRPO) treats all prompts equally. For a hard prompt where the model has high entropy (uncertainty), the sampled trajectories vary wildly in quality but lack reliable ground-truth feedback, resulting in noisy gradients that destabilize the policy.

Key Novelty

Intrinsic Confidence-Based Curriculum for RL

Uses the model's own 'confidence' (length-normalized negative entropy) as a proxy for sample difficulty, independent of external verifiers
Dynamically filters training batches to keep only high-confidence samples early on, ensuring gradients come from 'easy' problems where the model is decisive
Gradually anneals the filtering threshold to include harder samples, provably ensuring the surrogate objective converges to the true unbiased objective over time

Architecture

Illustration of the VI-CuRL framework contrasting it with standard RL.

Evaluation Highlights

Consistently outperforms verifier-independent baselines across six logic and math benchmarks (e.g., GSM8K, MATH, logical deduction)
Achieves performance competitive with oracle-verified methods (RLVR with ground truth) without using any external verifier during training
Prevents the 'training collapse' observed in standard GRPO baselines, maintaining stability throughout the optimization process

Breakthrough Assessment

8/10

Offers a theoretically grounded and empirically effective solution to a critical bottleneck in RLHF/RLVR—dependence on external verifiers—making post-training scalable to open-ended domains.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning on reasoning tasks without access to ground-truth verifiers or reward models during training

Inputs: Prompt x drawn from data distribution p(x)

Outputs: Reasoning trajectory y (sequence of tokens)

Pipeline Flow

Prompt Sampling
Confidence Estimation
Curriculum Filtering
Policy Optimization

System Modules

Policy Model

Generate reasoning trajectories and compute log-probabilities

Model or implementation: LLM (architecture not specified, likely standard Transformer)

Confidence Estimator

Calculate intrinsic confidence score for the prompt based on generation entropy

Model or implementation: Statistical aggregation

Curriculum Filter

Select prompts for the current update step based on dynamic quantile thresholding

Model or implementation: Algorithmic filter

Weighted GRPO Updater

Update policy parameters using importance-weighted surrogate loss

Model or implementation: Gradient Descent Optimizer

Novel Architectural Elements

Dynamic curriculum mechanism integrated into the RL loop that filters batches based on intrinsic entropy rather than external error
Weighted surrogate objective that mathematically guarantees asymptotic unbiasedness while reducing variance

Modeling

Base Model: Large Language Model (specific size/architecture not detailed in text)

Training Method: Verifier-Independent Curriculum Reinforcement Learning (VI-CuRL)

Objective Functions:

Purpose: Maximize expected reward while staying close to reference policy, restricted to high-confidence samples.

Formally: L_t(theta) = E[ (1/beta_t) * w_t(x) * L_GRPO(theta; x) ]

Key Hyperparameters:

beta_t: Target retention rate, linearly annealed from small initial value to 1
tau_t: Confidence threshold, dynamically set to (1 - beta_t)-th quantile

Compute: Not reported in the paper

Comparison to Prior Work

vs. AdaRFT/VCRL: VI-CuRL does not require external verifier signals to construct the curriculum, making it applicable where ground truth is absent
vs. Standard GRPO: Introduces a bias-variance trade-off via curriculum to prevent the high-variance collapse typical of verifier-free GRPO
vs. Self-Correction methods [not cited in paper]: VI-CuRL focuses on training stability via data selection rather than training a separate correction model or prompting for self-correction

Limitations

Relies on the assumption that intrinsic confidence (low entropy) correlates with correctness, which may not hold for hallucinations
Theoretical variance bounds depend on specific assumptions about gradient norms and confidence decay
Linear annealing schedule for retention rate is a heuristic; optimal schedules might vary by task

Reproducibility

The paper provides a detailed algorithm (Algorithm 1) and theoretical proofs in the appendix. Specific hyperparameters (learning rates, batch sizes) and model checkpoints are not detailed in the main text provided.

📊 Experiments & Results

Evaluation Setup

Logic and mathematical reasoning tasks

Benchmarks:

General benchmarks (Logic and Math Reasoning (6 benchmarks mentioned))

Metrics:

Success Rate / Accuracy
Training Stability (avoidance of collapse)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

VI-CuRL consistently outperforms verifier-independent baselines across all tested benchmarks.
The method effectively prevents the training collapse observed in standard GRPO when no verifier is used.
Performance is competitive with oracle-verified methods, suggesting latent capabilities can be unlocked without constant external supervision.
Variance decomposition confirms that the curriculum successfully reduces action and problem variance, the primary drivers of instability.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO framework)
Curriculum Learning concepts
Variance decomposition in statistical estimation
Language Model entropy and confidence metrics

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—enhancing reasoning by training models to maximize rewards checked by a verifier (e.g., code execution)

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same prompt against their group mean

verifier-independent: Training settings where no ground-truth checker (like a math solver or unit test) is available to score the model's outputs

intrinsic confidence: A metric derived from the model itself (specifically, normalized negative entropy) indicating how certain it is about its generation

curriculum learning: Training strategy starting with easy examples and gradually increasing difficulty

action variance: Variance in the gradient estimator arising from the stochasticity of the policy's actions (sampling different tokens)

problem variance: Variance in the gradient estimator arising from the diversity of prompts (different difficulty levels)

bias-variance trade-off: The balance between introducing a systematic error (bias) to reduce random noise (variance) in estimation; VI-CuRL accepts bias early on to lower variance

SFT: Supervised Fine-Tuning—initial training phase on labeled data before RL

KL regularization: Kullback-Leibler divergence penalty used to keep the RL policy from drifting too far from the reference model

stop-gradient: Operation preventing backpropagation through specific variables; used here for the curriculum weights and masks