VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models

📝 Paper Summary

LLM Post-training Reinforcement Learning for Reasoning

VCRL dynamically selects training samples with high reward variance—indicating they are at the edge of the model's current ability—and replays them via a memory bank to improve RL efficiency.

Core Problem

Existing rollout-based RL methods (like GRPO) sample training queries randomly, ignoring that samples too easy or too hard for the current model state provide little learning signal (low gradient variance).

Why it matters:

Inefficient training: Models waste compute on samples they already mastered (reward always 1) or cannot yet solve (reward always 0).
Dynamic difficulty: A sample's difficulty changes as the model learns, making static curriculum sorting (e.g., by question length) ineffective.
Human learning analogy: Humans learn best when tasks gradually increase in difficulty, matching their current zone of proximal development, which standard RL sampling fails to mimic.

Concrete Example: If a math problem is too hard, the model gets 0 reward across all rollouts (variance ≈ 0). If too easy, it gets 1 reward everywhere (variance ≈ 0). VCRL identifies valuable samples where the model succeeds ~50% of the time (high variance) and prioritizes them.

Key Novelty

Variance-based Curriculum Reinforcement Learning (VCRL)

Uses the variance of rewards within a group of rollouts as a proxy for sample difficulty; high variance implies the model is uncertain and the sample is 'learning-rich'.
Maintains a 'high-value memory bank' of these high-variance samples to replay them during training, ensuring the model focuses on the frontier of its capabilities rather than random data.

Architecture

The overall VCRL framework flow. It illustrates how a batch of queries generates responses, how variance is calculated to filter samples, and how the memory bank interacts with the training batch.

Evaluation Highlights

+24.8 points average improvement on Qwen3-8B-Base across 5 math benchmarks compared to the base model, significantly outperforming standard RL.
+4.67 points average improvement over the strongest baseline (GSPO) on Qwen3-8B-Base (57.76 vs 53.09).
Achieves SOTA performance on challenging competition math datasets like AIME-2024 and AIME-2025 across both 4B and 8B model sizes.

Breakthrough Assessment

7/10

Strong empirical gains on difficult math benchmarks and a theoretically grounded (variance-reduction) motivation. The method is a smart integration of curriculum learning into GRPO rather than a fundamental architectural shift.

⚙️ Technical Details

Problem Definition

Setting: Policy optimization for LLMs where a reward function r(x,y) (verifier) is available.

Inputs: Query x from dataset D

Outputs: Generated response y (chain-of-thought)

Pipeline Flow

Policy Model (generates G rollouts per prompt)
Reward Verifier (scores each rollout)
Variance Calculator (computes reward variance per prompt)
Memory Bank (stores/retrieves high-variance prompts)
Update Mechanism (GRPO-based policy update)

System Modules

Policy Model

Generate G responses for each query in the batch

Model or implementation: Qwen3-4B-Base or Qwen3-8B-Base

Variance-based Sampler (Curriculum Control)

Filter batch samples based on reward variance

Model or implementation: Statistical calculation

Memory Bank (Curriculum Control)

Store high-variance queries for replay

Model or implementation: Priority Queue

Novel Architectural Elements

Variance-based Dynamic Sampling mechanism injected into the RL loop
Priority-queue Memory Bank integrated with GRPO for curriculum replay

Modeling

Base Model: Qwen3-4B-Base, Qwen3-8B-Base

Training Method: VCRL (built on top of GRPO)

Objective Functions:

Purpose: Optimize policy using group relative advantages.

Formally: Standard GRPO objective but computed on a dynamically filtered batch D' composed of current high-variance samples and memory bank replays.
Purpose: Prioritize samples in Memory Bank.

Formally: P(x) = α * P_old(x) + (1-α) * p(x) + β(x), where p(x) is normalized reward variance and β(x) tracks staleness.

Training Data:

DAPO-Math-17K dataset (17k prompts with integer answers)

Key Hyperparameters:

learning_rate: 1e-6
batch_size: 128 prompts
rollouts_per_prompt: 16
+ 5 more
max_tokens: 4096
variance_threshold_kappa: 0.3 (first 20 steps), 0.8 (remaining)
momentum_constant_alpha: 0.9
training_steps: 500
optimizer: AdamW

Compute: 8x NVIDIA H20-3e GPUs

Comparison to Prior Work

vs. GRPO: VCRL adds variance-based filtering and memory replay; GRPO uses random sampling.
vs. DAPO: DAPO samples based on token-level loss/clip ranges; VCRL samples based on group reward variance.
vs. CL (Curriculum Learning) [not cited in paper]: Traditional CL sorts by fixed difficulty (e.g., length); VCRL defines difficulty dynamically via model uncertainty (variance).

Limitations

Relies on binary/verifiable rewards (correct/incorrect), may not generalize easily to open-ended tasks with scalar rewards.
Requires computing rollouts before determining sample value, which has a computational cost (though amortized by Memory Bank).
Memory bank adds complexity to the training pipeline compared to stateless GRPO.

Reproducibility

Code availability is not provided. Detailed hyperparameters for VCRL (thresholds, momentum) and baselines (GRPO, DAPO, GSPO) are provided. Dataset DAPO-Math-17K is public on HuggingFace.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning capability assessed via pass@1 (avg@16 rollouts).

Benchmarks:

AIME-2024 (High-difficulty Math Competition)
AIME-2025 (High-difficulty Math Competition)
MATH500 (General Math Problems)
OlympiadBench (Olympiad-level Math)
AMC23 (Math Competition)

Metrics:

Accuracy (avg@16)
Statistical methodology: Reported avg@16 over 16 evaluation repetitions for stability.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
VCRL consistently outperforms baselines on Qwen3-8B-Base across all benchmarks.
Average (5 benchmarks)	Accuracy	53.09	57.76	+4.67
AIME-2024	Accuracy	13.63	15.42	+1.79
Average (5 benchmarks)	Accuracy	46.17	49.43	+3.26
Average (5 benchmarks)	Accuracy	41.76	44.73	+2.97
Average (5 benchmarks)	Accuracy	44.73	49.43	+4.70

Experiment Figures

Training curves (Accuracy vs Steps) for VCRL against baselines (GRPO, DAPO, GSPO) on Qwen3-4B and Qwen3-8B.

Main Takeaways

VCRL provides consistent performance gains across different model sizes (4B and 8B) and all 5 math benchmarks.
The 'variance' of group rewards effectively identifies the 'zone of proximal development' for the model, filtering out trivial or impossible samples.
Replay Learning is crucial; simply sampling high-variance items is good, but replaying them from a memory bank yields the largest marginal gain.
VCRL accelerates training in the early stages (first 100 steps) compared to baselines, suggesting higher data efficiency.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Group Relative Policy Optimization (GRPO)
Curriculum Learning concepts

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt to estimate advantages without a value network.

RLVR: Reinforcement Learning with Verifiable Rewards—a setting where the environment provides a ground-truth signal (e.g., correct math answer) to score generations.

CoT: Chain-of-Thoughts—a prompting strategy where the model generates intermediate reasoning steps before the final answer.

Rollout: A single complete generation (trajectory) produced by the policy model given a prompt.

Replay Learning: A technique where past high-value training samples are stored in a buffer and re-sampled later to reinforce learning.

GSPO: Group Sequence Policy Optimization—a variant of GRPO using sequence-level importance ratios.

DAPO: Decoupled Clip and Dynamic sampling Policy Optimization—a variant of GRPO with specific clipping and sampling mechanisms.

Test-Time Scaling: The phenomenon where generating more tokens or samples at inference time (e.g., longer reasoning chains) leads to better performance.