The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Mathematical Reasoning

RLVR training paradoxically shrinks the set of solvable problems because on-policy sampling disproportionately reinforces already-high-likelihood solutions while negatively interfering with low-likelihood correct ones.

Core Problem

RLVR improves average accuracy (Pass@1) but often reduces the total number of problems a model can solve (coverage/Pass@k) by forgetting how to solve harder, low-likelihood problems.

Why it matters:

Models should expand their reasoning capabilities to new problems, not just become more confident on problems they can already solve
Decreasing Pass@k indicates a loss of diversity in reasoning strategies, limiting the model's robustness and exploration capability
Current regularization techniques like KL divergence and clipping fail to prevent this 'winner-take-all' collapse

Concrete Example: In the Minerva benchmark, Qwen2.5-Math starts with high accuracy using code-based reasoning but lower accuracy with natural language. During RLVR, it collapses entirely to natural language reasoning (the 'winner'), causing performance on code-solvable problems to degrade.

Key Novelty

SELF (Selective Examples with Low-likelihood and Forward-KL)

Analyzes RLVR dynamics to reveal 'negative interference,' where updating the model to solve easy problems actively harms its ability to solve harder ones
Identifies a 'winner-take-all' effect driven by on-policy sampling: the model reinforces the reasoning paths it already favors, ignoring valid but lower-probability paths
Proposes a data curation strategy that explicitly targets problems with low initial likelihood of correctness to counter this bias

Architecture

Conceptual illustration of the 'Winner-Take-All' phenomenon in RLVR

Evaluation Highlights

SELF improves Pass@1 performance on AIME24 compared to standard RLVR (PPO/GRPO) baselines
Standard RLVR shows a significant drop in Pass@256 (coverage) after ~300 steps, falling below the base model, while SELF mitigates this shrinkage
Strong correlation observed between negative interference metrics and the decline in Pass@k performance across four mathematical benchmarks

Breakthrough Assessment

8/10

Provides a crucial theoretical and empirical explanation for the widely observed 'coverage shrinkage' in RLVR. The proposed solution is simple yet effective, directly addressing the identified mechanism.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement learning for reasoning tasks where each problem induces a distinct MDP with binary outcomes (correct/incorrect)

Inputs: A prompt x (math problem)

Outputs: A reasoning chain and answer y

Pipeline Flow

Prompt Sampling (Math Problems)
Response Generation (Policy Rollout)
Reward Verification (Correctness Check)
Policy Update (PPO/GRPO with Data Curation)

System Modules

Base Policy

Initial language model used to generate solutions

Model or implementation: Qwen2.5-Math-1.5B/7B or Llama-3.2-3B-Instruct

Reward Verifier

Check if the generated answer matches the ground truth

Model or implementation: Deterministic rule-based checker

Data Curator (SELF)

Selects which problems to train on to mitigate winner-take-all effects

Model or implementation: Algorithm SELF

Novel Architectural Elements

SELF data curation mechanism: specifically filters training batches to prioritize low-likelihood correct solutions, reversing the standard on-policy bias

Modeling

Base Model: Qwen2.5-Math-1.5B/7B, Llama-3.2-3B-Instruct

Training Method: PPO / GRPO with SELF data curation

Objective Functions:

Purpose: Maximize expected reward while staying close to base model.

Formally: E[r(x,y) - beta * KL(pi_theta || pi_ref)]

Adaptation: Full fine-tuning (implied by context of RLVR on these sizes)

Training Data:

DeepScaleR dataset (approx 40k math problems)
Probing dataset D_prob generated from training prompts for analysis

Key Hyperparameters:

learning_rate: Not reported in the paper
clip_epsilon: 0.2
kl_coefficient_beta: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard PPO/GRPO: SELF modifies the data distribution to focus on low-likelihood problems, whereas standard RL samples proportionally to current policy likelihood (reinforcing the 'winner')
vs. RFT: RFT treats all correct solutions equally (or filters by correctness), whereas SELF specifically targets the 'hard' correct solutions that are at risk of being forgotten
vs. TRPO [not cited in paper]: TRPO enforces trust regions in policy space; SELF enforces diversity via data selection

Limitations

Analysis relies on an approximation of the influence function due to computational cost
Experiments focus exclusively on mathematical reasoning; generalization to coding or general domains is not tested
The 'low likelihood' selection heuristic requires estimating probabilities which can be noisy

Reproducibility

Code: https://github.com/mail-research/SELF-llm-interference

Code is available at https://github.com/mail-research/SELF-llm-interference. DeepScaleR dataset is used. Model weights for specific checkpoints are not explicitly mentioned as released.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning using Chain-of-Thought

Benchmarks:

AIME24 & AIME25 (Challenging math competitions)
Math500 (General math problems)
Minerva (Science and math questions)

Metrics:

Pass@1 (Accuracy)
Pass@k (Coverage, specifically k=256)
Perplexity (PPL)
Statistical methodology: Correlation analysis between influence metrics and Pass@k drop

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Average across 4 benchmarks	Pass@256	Not reported in the paper	Not reported in the paper	Negative
Average across 4 benchmarks	Correlation (Pearson)	0	Positive (Strong)	Positive
Minerva	Frequency of Code Reasoning	High (Visual approx >50%)	Low (Visual approx <10%)	Large negative

Experiment Figures

Evolution of Pass@1 vs Pass@256 and Interference metrics over training steps

Perplexity trends distinguishing between 'easy' (high likelihood) and 'hard' (low likelihood) problems

Main Takeaways

RLVR improves Pass@1 (average accuracy) but degrades Pass@k (coverage) for large k, meaning the model solves fewer unique problems overall
This degradation is driven by 'negative interference,' where updates for easy problems actively harm the model's ability to solve harder ones
On-policy sampling creates a 'winner-take-all' dynamic: the model reinforces the reasoning style (e.g., natural language) it is already good at, causing a collapse in diversity (e.g., losing code reasoning capabilities)
Standard regularizers like clipping and Reverse KL fail to prevent this collapse because they do not address the data distribution imbalance caused by on-policy sampling

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Language Model Reasoning (Chain-of-Thought)
Policy Gradients

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—optimizing LLMs using binary correctness signals (e.g., correct answer in math/code)

Pass@k: The probability that at least one solution is correct when k solutions are sampled from the model

Negative Interference: The phenomenon where learning to solve a specific set of training problems reduces the likelihood of generating correct solutions for other problems

Winner-take-all: A dynamic where the model reinforces only the most probable solution strategies (the 'winners') and suppresses diverse but initially less probable valid strategies

On-policy sampling: Generating training data using the current version of the model policy, which biases learning toward what the model already knows well

Plasticity loss: The loss of a neural network's ability to learn new things or adapt to new distributions over time

PPO: Proximal Policy Optimization—a standard RL algorithm that constrains policy updates to be close to the previous policy

GRPO: Group Relative Policy Optimization—a PPO variant often used for reasoning that normalizes advantages within a group of sampled outputs for the same prompt

Perplexity: A measurement of how well a probability model predicts a sample; lower perplexity means the model is more confident

KL regularization: A penalty term added to the loss function to prevent the learned policy from diverging too far from a reference policy (usually the base model)