← Back to Paper List

On the Learning Dynamics of RLVR at the Edge of Competence

Yu Huang, Zixin Wen, Yuejie Chi, Yuting Wei, Aarti Singh, Yingbin Liang, Yuxin Chen
Wharton School, University of Pennsylvania, Carnegie Mellon University, Yale University, The Ohio State University
arXiv (2026)
RL Reasoning

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Compositional Reasoning Theoretical Analysis of Deep Learning
RLVR solves long-horizon reasoning tasks via a "relay effect" where smooth difficulty curricula bridge the gradient barrier, whereas large difficulty gaps cause stalling and grokking-like phase transitions.
Core Problem
RLVR relies on sparse, outcome-based rewards (correct/incorrect), which provide little signal for long-horizon reasoning tasks where the search space of trajectories is exponentially large.
Why it matters:
  • It remains a mystery how outcome-only feedback can drive learning in complex reasoning chains (like math or coding) without dense intermediate supervision
  • Understanding these dynamics is crucial for scaling reasoning models (like OpenAI-o3 or DeepSeek-R1) efficiently rather than relying on trial-and-error data mixing
  • Current empirical observations of 'grokking' (sudden learning after long plateaus) lack a rigorous theoretical mechanism explaining when and why they occur
Concrete Example: In a multi-step state tracking task (e.g., applying 45 sequential operations), a model initialized with random attention has a near-zero chance of guessing the final state correctly. Without intermediate feedback, the gradient is exponentially flat, and the model learns nothing for a long time.
Key Novelty
Theoretical framework for 'Relay Dynamics' vs 'Grokking' in RLVR
  • Identifies that the smoothness of the problem difficulty spectrum determines learning phases: smooth spectra allow easier problems to 'relay' gradient signals to slightly harder ones
  • Demonstrates that 'grokking' (long plateaus followed by jumps) arises specifically from discontinuities in difficulty, where the model must over-master an easy task before the next hard task provides any signal
  • Introduces a Fourier analysis framework on finite groups to mathematically estimate policy gradients for long-horizon compositional tasks, overcoming the intractability of trajectory-level probability calculations
Evaluation Highlights
  • Synthetic experiments show mixed-difficulty training with a moderate ratio (R=3) enables solving long-horizon tasks (Length=45), whereas fixed-length training at Length=45 fails completely (near-zero reward)
  • Large difficulty ratios (R=7) cause 'grokking': the model stalls at near-zero reward on longer tasks for extended periods before sudden mastery, confirming theoretical predictions of phase transitions
  • Short-horizon training (Length=5) succeeds rapidly with optimal rewards, while horizons beyond a critical threshold (approx. Length > 20) exhibit prolonged reward plateaus in the absence of a curriculum
Breakthrough Assessment
8/10
Provides the first rigorous theoretical explanation for why RLVR works 'at the edge of competence' and mechanistically explains the grokking phenomenon in reasoning tasks. Highly relevant to current LLM reasoning developments.
×