← Back to Paper List

How Far Can Unsupervised RLVR Scale LLM Training?

Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, Xiusi Chen, Youbang Sun, Xingtai Lv, Xuekai Zhu, Li Sheng, Ran Li, Huan-ang Gao, Yuchen Zhang, Bowen Zhou, Zhiyuan Liu, Ning Ding
Tsinghua University, Shanghai AI Lab, Xi'an Jiaotong University, University of Illinois Urbana-Champaign, Frontis.AI, Shanghai Jiao Tong University, Peking University
arXiv (2026)
RL Reasoning Benchmark

📝 Paper Summary

Unsupervised Reinforcement Learning Verifiable Rewards (RLVR) Post-training scaling
Intrinsic unsupervised RLVR methods fundamentally rely on sharpening the model's initial distribution, which works only when initial confidence aligns with correctness and inevitably collapses otherwise.
Core Problem
Supervised RLVR relies on expensive ground-truth labels that are hard to scale, while current unsupervised alternatives using intrinsic rewards (like self-consistency) suffer from poorly understood failure modes like collapse.
Why it matters:
  • Scaling supervision requires prohibitive human costs as models surpass human expertise
  • Current unsupervised methods report inconsistent gains without a unified understanding of their mechanisms or limitations
  • Reward hacking and model collapse prevent intrinsic rewards from serving as a robust long-term scaling solution
Concrete Example: In math problem solving, a model might be confident in a wrong answer (high probability but incorrect). Intrinsic methods like majority voting will reinforce this wrong answer because they reward consistency rather than truth, leading the model to become confidently wrong.
Key Novelty
Sharpening Mechanism Theory & Model Collapse Step
  • Unified theoretical framework proving all intrinsic rewards (voting, entropy, etc.) converge by sharpening the model's initial distribution, amplifying existing preferences rather than discovering new knowledge
  • Identification of a universal 'rise-then-fall' training pattern where early gains come from sharpening correct confident answers before collapse occurs due to amplifying wrong confident ones
  • Proposal of 'Model Collapse Step' as a metric to measure model priors and predict RL trainability without expensive full training runs
Evaluation Highlights
  • Intrinsic rewards match supervised RL gains early in training (e.g., on AIME 2024) but inevitably collapse after ~1000 steps due to reward hacking
  • Small datasets (≤128 samples) prevent model collapse, enabling safe deployment for Test-Time Training
  • Model Collapse Step correlates strongly with Ground Truth Gain, serving as a better predictor for RL trainability than Pass@k
Breakthrough Assessment
8/10
Provides a crucial reality check for the unsupervised RL hype by theoretically proving the limits of intrinsic rewards, while offering practical metrics and identifying safe operating regimes like test-time training.
×