Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Synthetic Data Generation Self-play

SvS is an online self-play strategy where the policy synthesizes difficult variational problems from its own correct solutions, maintaining training entropy and improving Pass@k performance where standard RLVR fails.

Core Problem

Standard RLVR training improves Pass@1 at the expense of policy entropy (diversity), leading to 'mode collapse' where the model memorizes solutions and stops exploring, causing Pass@k performance to plateau.

Why it matters:

Pass@k represents the upper bound of an LLM's reasoning capability; failing to improve it limits the model's potential to solve harder problems.
Current methods for gathering training data (human annotation or external synthesis) lack the precise ground-truth answers required for verifiable reward training.
Existing RLVR strategies narrow reasoning trajectories toward the most reward-prone solutions, reducing exploration capacity.

Concrete Example: When training on a limited math set, a standard RLVR policy quickly learns one specific solution path for a problem and repeats it to maximize reward (hacking). Consequently, its entropy drops to zero, and it fails to solve variations of that problem or harder problems requiring novel reasoning steps.

Key Novelty

Self-play with Variational Problem Synthesis (SvS)

Uses the policy's own correct solutions to challenging problems as context to generate 'variational problems' (rephrased/restructured versions) that share the exact same answer.
Bypasses the need for external answer labeling because the synthetic problems are derived inversely from correct solutions to known problems.
Introduces a self-improving loop where the policy solves original problems, creates new variations from successes, and then solves those variations, keeping training data fresh and diverse.

Architecture

The data workflow loop in a single training iteration of SvS.

Evaluation Highlights

+18.3% and +22.8% absolute gain on Pass@32 for AIME 24 and AIME 25 benchmarks respectively using Qwen2.5-32B-Instruct compared to standard RLVR.
Achieves ~3% average absolute improvement over standard RLVR baselines across 12 reasoning benchmarks for models ranging from 3B to 32B parameters.
Maintains stable policy entropy throughout training, whereas standard RLVR shows a continuous decline (entropy collapse).

Breakthrough Assessment

9/10

Addresses a critical bottleneck in RLVR (entropy collapse/Pass@k plateau) with a self-contained solution requiring no external supervision. The gains on competition math (AIME) are substantial.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Rewards (RLVR) for reasoning tasks

Inputs: A dataset of reasoning problems D = {(x, a)} where x is the problem and a is the ground truth answer

Outputs: A policy πθ that generates reasoning steps y ending in the correct answer a

Pipeline Flow

Original Problem Solving: Policy generates solutions for batch; filter for underperforming problems
Variational Synthesis: Policy uses correct solutions to synthesize new variational problems
Synthetic Solving: Policy attempts to solve the new variational problems
Policy Update: Update policy using original solutions, synthetic problems, and synthetic solutions

System Modules

Policy Model (Solver) (Reasoning & Generation)

Solves problems and generates reasoning traces

Model or implementation: Qwen2.5 (3B, 32B) or LLaMA-3.1 (8B)

Policy Model (Synthesizer) (Reasoning & Generation)

Generates variational problems based on correct solutions

Model or implementation: Same shared policy model

Reward & Filter Mechanism

Calculates rewards and filters data for updates

Model or implementation: Rule-based

Novel Architectural Elements

Joint optimization loop where the policy is simultaneously the problem solver and the problem proposer (synthesizer) within the same RL step
Inverse-mapping verification: Validating synthetic problems by checking if the policy's solutions to them match the *original* problem's answer

Modeling

Base Model: Qwen2.5-32B-Instruct, Qwen2.5-3B-Instruct, LLaMA-3.1-8B-Instruct

Training Method: GRPO (Group Relative Policy Optimization)

Objective Functions:

Purpose: Optimize policy to maximize expected reward.

Formally: Standard GRPO objective taking the average of advantages over a group of outputs.
Purpose: Validate variational problems.

Formally: Reward R_v(x_hat) = 1 if avg_accuracy is between thresholds (challenging but solvable), else 0.

Adaptation: Full fine-tuning

Trainable Parameters: All parameters

Training Data:

MATH-12k (12,000 problems)
DAPO-17k (17,000 competition-level problems)
DeepMath (8k open-ended problems, optional)

Key Hyperparameters:

learning_rate: Not explicitly reported in the paper
beta_kl: Not explicitly reported in the paper
group_size_G: Not explicitly reported in the paper
+ 1 more
max_response_tokens: 24k (implied from Figure 6 caption context)

Compute: 32 H100 GPUs for 32B model experiments

Comparison to Prior Work

vs. Standard RLVR: SvS updates data online with self-synthesized variations, maintaining entropy.
vs. Offline Rephrasing: SvS ensures answer consistency by deriving problems from correct solutions and validating via self-play (inverse mapping).
vs. Generative Teachers [not cited in paper]: SvS does not require a stronger teacher model; it is fully self-improving.
+ 1 more
vs. RFT (Rejection Sampling Fine-Tuning) [not cited in paper]: SvS is an online RL method rather than offline SFT on positive samples.

Limitations

Computational cost is higher than standard RLVR due to the synthesis and additional solving steps (though comparable on simpler datasets like MATH-12k).
Performance on open-ended answer tasks may degrade if the augmentation overfits to specific formats (e.g., integer-only).
Requires the base model to have some initial capability to solve problems to generate the 'seed' solutions for synthesis.

Reproducibility

Code: https://github.com/MasterVito/SvS

Code and model weights are public. Hyperparameters like learning rate and KL coefficient are missing from the text. Prompt templates for variational synthesis are referenced (Figure 20 in Appendix).

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning and code generation tasks evaluated via exact match of final answers.

Benchmarks:

AIME 2024 (Competition Math)
AIME 2025 (Competition Math)
MATH-500 (Math Reasoning)
Beyond-AIME (Competition Math) [New]
GSM8K (Grade School Math)
CodeContests (Competitive Programming)

Metrics:

Pass@1
Pass@32
Statistical methodology: Average of 32 inferences used for AIME-level benchmarks on smaller models to mitigate randomness.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Competition-level math results (AIME) showing massive gains in Pass@32 for the 32B model trained on DAPO-17k.
AIME 24	Pass@32	52.5	70.8	+18.3
AIME 25	Pass@32	42.4	65.2	+22.8
Pass@1 results across diverse benchmarks for the 32B model trained on MATH-12k.
MATH-500	Pass@1	86.4	87.2	+0.8
AIME 24	Pass@1	26.7	30.0	+3.3
Beyond-AIME	Pass@1	54.8	57.3	+2.5
Code generation efficiency results using Qwen2.5-7B-Instruct.
CodeContests	Pass@1 (Avg@16)	32.0	48.0	+16.0

Experiment Figures

Policy entropy and Pass@k trajectories during training with different data strategies.

Scaling of Pass@k performance (log scale samples) on AIME and MATH-500.

Main Takeaways

SvS prevents entropy collapse: Unlike RLVR, where entropy steadily declines, SvS maintains stable entropy due to the continuous introduction of fresh variational problems.
Scalable Pass@k: SvS outperforms RLVR significantly as k increases (scaling test up to k=1024), indicating it preserves diverse reasoning paths.
Data Efficiency: On code generation, SvS achieves superior performance in ~100 steps compared to RLVR's >600 steps.
Ablation confirms necessity of complexity: Augmenting with 'simpler' problems (high solve rate) fails to improve Pass@32; targeting 'underperforming' problems is key.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Language Model Post-training
Rejection Sampling / Pass@k metrics

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training LLMs using outcomes (correct/incorrect) as the primary reward signal

Pass@k: A metric evaluating the probability that at least one correct solution is generated out of k independent samples

Pass@1: The accuracy of the model when generating a single solution

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same input to reduce variance without a value network

Policy Entropy: A measure of the randomness/diversity in the model's token predictions; low entropy indicates the model is confident but repetitive (collapsed)

Mode Collapse: A failure mode where the model converges to a limited set of outputs, losing diversity and exploration capability

Variational Problems: Synthetically generated problems that differ in wording or structure from an original problem but preserve the underlying logic and final answer

Self-play: A training paradigm where the model generates its own training data (problems and solutions) and learns from it

Reward Shaping: Modifying the raw reward signal (e.g., correct/incorrect) to guide the learning process more effectively (e.g., penalizing trivial synthetic problems)