The Invisible Leash: Why RLVR May or May Not Escape Its Origin

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Reasoning Capabilities of LLMs Model Analysis and Evaluation

RLVR primarily acts as a support-constrained optimizer that improves precision by concentrating probability on known solutions rather than expanding the model's reasoning capabilities to discover genuinely new solution paths.

Core Problem

It is unclear whether Reinforcement Learning with Verifiable Rewards (RLVR) genuinely expands a model's reasoning boundaries or merely amplifies high-reward outputs the base model already knows.

Why it matters:

If RLVR only reinforces existing patterns, it cannot unlock advanced reasoning capabilities for models that lack them initially (e.g., GPT-2)
Standard metrics like pass@1 may mask the loss of solution diversity, leading to models that fail on underrepresented correct solutions
Understanding this limitation is crucial for designing hybrid strategies that can actually seed probability mass into new, correct solution regions

Concrete Example: On the AIME 2024 benchmark, while the ProRL-1.5B RLVR model may have higher precision on a single attempt, the base model actually achieves a higher pass@8192 score (93.3%) compared to the RLVR model (83.3%). This happens because the RLVR model has 'shrunk' its support, losing access to valid solution paths that the base model could find given enough samples.

Key Novelty

Empirical Support Analysis Framework

Introduces the concept of 'empirical support'—the set of correct solutions a model can realistically discover under finite sampling thresholds
Defines four distinct solution categories: Preservation (kept), Expansion (newly found), Shrinkage (lost), and Out-of-support
Identifies the 'Invisible Leash' phenomenon where RLVR increases local token-level entropy (uncertainty) while paradoxically decreasing global answer-level entropy (diversity), effectively narrowing the solution space

Evaluation Highlights

RLVR results in net support shrinkage across benchmarks, losing ~3.6x more solutions than it gains (ProRL-1.5B-v2 loses 175 completions while gaining only 48)
Base models outperform RLVR models at large sampling budgets on AIME 2024 (Base pass@8192: 93.3% vs. ProRL-1.5B: 83.3%)
Support Retention Rate (SRR) is extremely high (0.93–0.99) across 1.5B–14B models, confirming RLVR mostly preserves known solutions rather than discovering new ones

Breakthrough Assessment

8/10

Provides a critical, empirically grounded counter-narrative to the RLVR hype. The definitions of empirical support and the metrics (SRR, NDR) offer a new rigorous lens for evaluating reasoning progress beyond simple accuracy.

⚙️ Technical Details

Problem Definition

Setting: Optimizing a policy to maximize verifiable rewards while managing divergence from a base model

Inputs: Natural language prompt x

Outputs: Reasoning trace and final completion y

Pipeline Flow

Base Model Initialization
RL Optimization (GRPO/PPO)
Policy Model Inference

System Modules

Base Model

Provides the initial probability distribution q(y|x) and initial set of accessible solutions

Model or implementation: DeepSeek-R1-Distill-Qwen-1.5B (and others like Nemotron-7B)

RL Optimization

Updates model parameters to maximize verifiable reward R(x,y) while penalizing divergence from base model

Model or implementation: Algorithm: GRPO (Group Relative Policy Optimization) or PPO

Policy Model

Generates solutions during evaluation to measure empirical support metrics

Model or implementation: ProRL-1.5B (and variants)

Novel Architectural Elements

Empirical Support Analysis Framework: A metric-based conceptual architecture for quantifying solution space overlap (Preservation, Expansion, Shrinkage) rather than just accuracy

Modeling

Base Model: DeepSeek-R1-Distill-Qwen-1.5B, Skywork-Reward-Llama-3.1-8B, Nemotron-4-340B-Base (varied across experiments)

Training Method: Reinforcement Learning with Verifiable Rewards (RLVR), specifically GRPO (Group Relative Policy Optimization) and PPO

Objective Functions:

Purpose: Maximize expected verifiable reward while regularizing against base model drift.

Formally: max_theta E[R(x,y) - beta^-1 * log(pi_theta(y|x) / q(y|x))]

Adaptation: Full model update (implied by 'large-scale RLVR')

Training Data:

Not detailed in snippet (likely standard RLVR prompt sets for math/code)

Key Hyperparameters:

sampling_budget_math: k=4096 or 8192
sampling_budget_reasoning_gym: k=1024 to 16384
sampling_budget_other: k=1024 or 2048
+ 1 more
beta: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-R1/OpenAI-o1: This paper *analyzes* the RLVR methods used by these models rather than proposing a new model architecture
vs. SFT: Shows that SFT consistently produces moderate support expansion (positive NSCR) whereas RLVR produces support shrinkage (negative NSCR) despite higher single-sample precision
vs. CoT-Pass@K [cited in paper]: This paper focuses on 'Empirical Support' (existence of solutions) rather than just reasoning chain correctness

Limitations

Pass@k is used as a proxy for reasoning boundaries, which primarily captures solution retrieval rather than novel reasoning capacity
Analysis relies on finite sampling budgets (k=4096-16384), meaning extremely low-probability solutions might be missed (mitigated by epsilon-support definition)
Does not propose a concrete algorithmic solution to break the 'invisible leash', only identifies the phenomenon

Reproducibility

The paper uses publicly available base models (DeepSeek-R1-Distill, Nemotron, Skywork, Phi4) and standard benchmarks (MATH500, AIME, SimpleQA). The specific 'ProRL' checkpoints are cited from Liu et al. (2025a). Code for the empirical support analysis framework is not explicitly explicitly linked in the snippet.

📊 Experiments & Results

Evaluation Setup

Large-scale sampling from Base and RLVR models to estimate solution support overlap

Benchmarks:

MATH500 (Mathematics problem solving)
AIME 2024 / AIME 2025 (Challenging mathematics competitions)
Reasoning Gym (Cognition, geometry, graph theory, games)
SimpleQA (Factuality evaluation)
LiveBench (Logic, coding, language comprehension)

Metrics:

Support Retention Rate (SRR)
Net Discovery Rate (NDR)
Net Support Change Rate (NSCR)
Support Dynamic Score (SDS)
Pass@k
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of ProRL-1.5B-v2 on Math Benchmarks shows high retention but negligible discovery.
Math Benchmarks (Aggregate)	SRR (Support Retention Rate)	1.00	0.96	-0.04
Math Benchmarks (Aggregate)	NDR (Net Discovery Rate)	0.00	0.01	+0.01
Pass@k comparison on AIME 2024 showing the 'invisible leash' effect at high sample counts.
AIME 2024	pass@8192	93.3	83.3	-10.0
Comparison of support change counts showing net shrinkage.
All Benchmarks (Aggregate)	Count of Correct Completions Change	48	175	+127

Experiment Figures

Pass@k curves for Base vs. RLVR models on Reasoning Gym tasks (leg_counting, family_relationships).

Venn diagrams or bar charts comparing solution sets between Base and RLVR models.

Main Takeaways

RLVR consistently acts as a support-constrained optimization mechanism, with high Support Retention Rate (SRR > 0.90) but very low Net Discovery Rate (NDR < 0.04).
Across diverse domains (Math, Logic, QA), RLVR loses access to more correct solutions than it gains (Net Support Change Rate is consistently negative).
There is a divergence between local and global entropy: RLVR models may appear more uncertain at each token step (higher local entropy) but converge to a smaller set of distinct final answers (lower global entropy).
SFT (Supervised Fine-Tuning) produces moderate support expansion (positive NSCR), whereas RLVR produces support shrinkage, suggesting the 'invisible leash' is specific to the RL objective.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
KL Divergence and Entropy
Pass@k evaluation metrics
Sampling-based generation (temperature, top-p)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—optimizing models using objective, computable success signals (like math answers or code execution) rather than human preference labels

Empirical Support: The set of correct solutions that a model can realistically discover under a finite sampling budget (probability greater than a small epsilon)

SRR: Support Retention Rate—the fraction of the base model's accessible correct solutions that remain accessible after RLVR training

NDR: Net Discovery Rate—the fraction of the RLVR model's accessible solutions that were effectively inaccessible to the base model (genuine discoveries)

NSCR: Net Support Change Rate—a metric capturing whether the accessible solution space has expanded (positive) or shrunk (negative) overall

Pass@k: A metric measuring the probability that at least one correct solution is generated within k independent samples

Invisible Leash: The hypothesis that RLVR is fundamentally constrained by the base model's initial distribution and cannot easily discover reasoning patterns outside that support

SDS: Support Dynamic Score—a harmonic mean balancing retention of old solutions and discovery of new ones

GRPO: Group Relative Policy Optimization—an RL algorithm used for reasoning models (like DeepSeek-R1) that normalizes rewards within a group of samples for the same prompt