← Back to Paper List

The Invisible Leash: Why RLVR May or May Not Escape Its Origin

Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, Yejin Choi
arXiv (2025)
RL Reasoning Benchmark Factuality MM

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Reasoning Capabilities of LLMs Model Analysis and Evaluation
RLVR primarily acts as a support-constrained optimizer that improves precision by concentrating probability on known solutions rather than expanding the model's reasoning capabilities to discover genuinely new solution paths.
Core Problem
It is unclear whether Reinforcement Learning with Verifiable Rewards (RLVR) genuinely expands a model's reasoning boundaries or merely amplifies high-reward outputs the base model already knows.
Why it matters:
  • If RLVR only reinforces existing patterns, it cannot unlock advanced reasoning capabilities for models that lack them initially (e.g., GPT-2)
  • Standard metrics like pass@1 may mask the loss of solution diversity, leading to models that fail on underrepresented correct solutions
  • Understanding this limitation is crucial for designing hybrid strategies that can actually seed probability mass into new, correct solution regions
Concrete Example: On the AIME 2024 benchmark, while the ProRL-1.5B RLVR model may have higher precision on a single attempt, the base model actually achieves a higher pass@8192 score (93.3%) compared to the RLVR model (83.3%). This happens because the RLVR model has 'shrunk' its support, losing access to valid solution paths that the base model could find given enough samples.
Key Novelty
Empirical Support Analysis Framework
  • Introduces the concept of 'empirical support'—the set of correct solutions a model can realistically discover under finite sampling thresholds
  • Defines four distinct solution categories: Preservation (kept), Expansion (newly found), Shrinkage (lost), and Out-of-support
  • Identifies the 'Invisible Leash' phenomenon where RLVR increases local token-level entropy (uncertainty) while paradoxically decreasing global answer-level entropy (diversity), effectively narrowing the solution space
Evaluation Highlights
  • RLVR results in net support shrinkage across benchmarks, losing ~3.6x more solutions than it gains (ProRL-1.5B-v2 loses 175 completions while gaining only 48)
  • Base models outperform RLVR models at large sampling budgets on AIME 2024 (Base pass@8192: 93.3% vs. ProRL-1.5B: 83.3%)
  • Support Retention Rate (SRR) is extremely high (0.93–0.99) across 1.5B–14B models, confirming RLVR mostly preserves known solutions rather than discovering new ones
Breakthrough Assessment
8/10
Provides a critical, empirically grounded counter-narrative to the RLVR hype. The definitions of empirical support and the metrics (SRR, NDR) offer a new rigorous lens for evaluating reasoning progress beyond simple accuracy.
×