← Back to Paper List

The Path Not Taken: RLVR Provably Learns Off the Principals

Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, David Z. Pan, Zhangyang Wang, Yuandong Tian, Kai Sheng Tai
Meta AI, The University of Texas at Austin
arXiv (2025)
RL Reasoning Benchmark

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Large Reasoning Models (LRMs) Post-training dynamics
RLVR improves reasoning not by changing principal weights, but by updating low-magnitude, off-principal parameters in a pattern dictated by the pretrained model's geometry and amplified by bfloat16 precision.
Core Problem
While RLVR drives massive reasoning gains, it paradoxically modifies very few parameters (high update sparsity), and the mechanism behind where and why these sparse updates occur is unknown.
Why it matters:
  • Current SFT-based intuition suggests targeting 'principal' high-magnitude weights, which fails for RLVR, leading to ineffective training algorithms
  • Understanding RL dynamics is crucial for designing efficient post-training methods rather than blindly applying SFT-era heuristics like LoRA/PiSSA
  • The paradox of 'high gain from minimal change' challenges standard views on how deep learning models acquire new capabilities
Concrete Example: When applying PiSSA (a PEFT method that targets principal weights) to RLVR, the model collapses or fails to improve because it forces updates into high-curvature directions that RL inherently avoids, unlike standard LoRA which allows off-principal updates.
Key Novelty
Three-Gate Theory of RLVR Dynamics
  • **Gate I (KL Anchor):** On-policy RL imposes a strict trust-region constraint, limiting how far parameters can move from the base policy in a single step
  • **Gate II (Model Geometry):** This constraint steers updates away from high-curvature 'principal' directions (which would break the constraint) and into low-curvature, spectrum-preserving subspaces
  • **Gate III (Precision):** bfloat16 storage filters out micro-updates in non-preferred regions, making the continuous off-principal bias appear as discrete sparsity
Evaluation Highlights
  • RLVR updates overlap with principal weights at sub-random rates, whereas SFT targets them
  • Freezing 50% of weights (principal/high-magnitude) and updating only the rest recovers full RLVR performance and KL trajectory on DeepSeek-R1-Distill-Qwen-1.5B
  • Disrupting model geometry via orthogonal rotation of layers destroys the update bias, confirming it is model-conditioned
Breakthrough Assessment
9/10
Provides the first mechanistic, parameter-level explanation for RLVR's unique optimization regime. Fundamentally shifts PEFT design from SFT-mimicry to geometry-aware methods.
×