← Back to Paper List

$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

Yi-Kai Zhang, Yueqing Sun, Hongyan Hao, Qi Gu, Xunliang Cai, De-Chuan Zhan, Han-Jia Ye
School of Artificial Intelligence, Nanjing University, Meituan, China, National Key Laboratory for Novel Software Technology, Nanjing University
arXiv (2026)
RL Reasoning

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Value Estimation / Baseline Construction Efficient RL Post-training for LLMs
V0.5 stabilizes sparse RL training by fusing an unbiased but noisy empirical mean with a stable but potentially biased generalist value prior, dynamically allocating rollouts only when the two conflict.
Core Problem
In RLVR, estimating baselines via sparse rollouts causes high variance that destabilizes training, while parameterized value models require expensive synchronous training and suffer from distribution shifts.
Why it matters:
  • High variance in baseline estimation leads to unstable policy gradients, hindering the optimization of complex reasoning tasks
  • Traditional value models (critics) introduce a 'coupling dilemma,' requiring massive compute to retrain the critic alongside the policy
  • Sparse sampling is necessary for long-horizon tasks due to computational costs, but it inherently lacks statistical precision
Concrete Example: When a policy generates only 4 responses (sparse rollouts) for a math problem, the empirical mean reward fluctuates wildly. A standard critic might hallucinate a value, biasing the update. V0.5 uses the critic as a prior but rejects it if the 4 rollouts statistically prove it wrong.
Key Novelty
Adaptive Prior-Empirical Fusion with Sequential Budgeting
  • Treats a frozen Generalist Value Model (V0) as a statistical prior, fusing it with the empirical mean of rollouts via a shrinkage estimator to minimize Mean Squared Error (MSE)
  • Implements a 'deviation test' equivalent to a hypothesis test: if the prior aligns with rollouts, it reduces variance; if it conflicts (hallucination), the system reverts to the empirical mean
  • Uses One-Step-Look-Ahead (OSLA) sequential analysis to dynamically decide whether to stop sampling or request more rollouts based on real-time uncertainty
Evaluation Highlights
  • Achieves >10% performance improvement over GRPO and DAPO across six mathematical reasoning benchmarks
  • Guarantees stable policy gradients even with extreme sparsity (group size of 4), where standard empirical baselines fail
  • Orthogonally decomposes baseline MSE to linearly suppress overall policy gradient variance
Breakthrough Assessment
9/10
Elegantly solves the critic coupling problem by turning value estimation into a statistical inference task. The theoretical MSE decomposition and dynamic budgeting offer a rigorous, compute-efficient alternative to standard PPO/GRPO.
×