The Path Not Taken: RLVR Provably Learns Off the Principals

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Large Reasoning Models (LRMs) Post-training dynamics

RLVR improves reasoning not by changing principal weights, but by updating low-magnitude, off-principal parameters in a pattern dictated by the pretrained model's geometry and amplified by bfloat16 precision.

Core Problem

While RLVR drives massive reasoning gains, it paradoxically modifies very few parameters (high update sparsity), and the mechanism behind where and why these sparse updates occur is unknown.

Why it matters:

Current SFT-based intuition suggests targeting 'principal' high-magnitude weights, which fails for RLVR, leading to ineffective training algorithms
Understanding RL dynamics is crucial for designing efficient post-training methods rather than blindly applying SFT-era heuristics like LoRA/PiSSA
The paradox of 'high gain from minimal change' challenges standard views on how deep learning models acquire new capabilities

Concrete Example: When applying PiSSA (a PEFT method that targets principal weights) to RLVR, the model collapses or fails to improve because it forces updates into high-curvature directions that RL inherently avoids, unlike standard LoRA which allows off-principal updates.

Key Novelty

Three-Gate Theory of RLVR Dynamics

**Gate I (KL Anchor):** On-policy RL imposes a strict trust-region constraint, limiting how far parameters can move from the base policy in a single step
**Gate II (Model Geometry):** This constraint steers updates away from high-curvature 'principal' directions (which would break the constraint) and into low-curvature, spectrum-preserving subspaces
**Gate III (Precision):** bfloat16 storage filters out micro-updates in non-preferred regions, making the continuous off-principal bias appear as discrete sparsity

Evaluation Highlights

RLVR updates overlap with principal weights at sub-random rates, whereas SFT targets them
Freezing 50% of weights (principal/high-magnitude) and updating only the rest recovers full RLVR performance and KL trajectory on DeepSeek-R1-Distill-Qwen-1.5B
Disrupting model geometry via orthogonal rotation of layers destroys the update bias, confirming it is model-conditioned

Breakthrough Assessment

9/10

Provides the first mechanistic, parameter-level explanation for RLVR's unique optimization regime. Fundamentally shifts PEFT design from SFT-mimicry to geometry-aware methods.

⚙️ Technical Details

Problem Definition

Setting: Post-training of Large Language Models via Reinforcement Learning with Verifiable Rewards (RLVR)

Inputs: Pretrained LLM parameters θ, prompt x, reward function R(x,y)

Outputs: Updated parameters θ+ maximizing reward while minimizing KL divergence from reference

Pipeline Flow

Gate I: KL Anchor (Constrains update magnitude)
Gate II: Model Geometry (Steers update to off-principal directions)
Gate III: Precision (Filters micro-updates < bfloat16 threshold)

System Modules

Gate I (KL Anchor)

Impose KL-divergence bound on policy updates

Model or implementation: Mathematical constraint derived from RL objective

Gate II (Model Geometry)

Direct updates to low-curvature subspaces

Model or implementation: Pretrained weight matrix geometry

Gate III (Precision)

Hide micro-updates below precision threshold

Model or implementation: bfloat16 representation

Novel Architectural Elements

Does not propose a new model architecture, but a theoretical framework (Three-Gate Theory) explaining existing RLVR dynamics

Modeling

Base Model: DeepSeek-R1-Distill-Qwen-1.5B (primary), also Qwen3-8B-Base, Llama-3-8B-Instruct

Training Method: RLVR (GRPO, DAPO, Reinforcement++)

Objective Functions:

Purpose: Maximize reward with KL penalty.

Formally: max_θ E[R(x, y) - β KL(π_θ || π_ref)]
Purpose: Approximate policy gradient with clipping.

Formally: L_PG(θ) = -E[A(x,y) log π_θ(y|x)] with importance sampling and clipping

Adaptation: Full fine-tuning (analyzed), LoRA/PiSSA (compared)

Trainable Parameters: Full weights (but effective updates are sparse)

Training Data:

Math (DeepMath-103K)
Code (DeepCoder)
Mixed (STEM, logic puzzles)

Key Hyperparameters:

learning_rate: 1e-5 to 1e-4 (varied in PEFT experiments)
clip_epsilon: Not explicitly reported in the paper
kl_coefficient: Not explicitly reported in the paper
+ 1 more
lora_rank: 8, 32, 64

Compute: Not reported in the paper

Comparison to Prior Work

vs. PiSSA: PiSSA targets principal directions (SFT-aligned); RLVR fails/collapses with principal targeting because it needs off-principal updates
vs. LoRA: LoRA allows off-principal updates and matches RLVR better than PiSSA
vs. Sparse FT (Principal): Updating only top-magnitude weights degrades RLVR; updating bottom-magnitude weights works well (opposite of SFT)

Limitations

Analysis relies heavily on bfloat16 precision artifacts to explain sparsity visibility (though mechanism is fundamental)
Does not propose a fully new RL algorithm, but rather a theoretical understanding and mask-based intervention
Primary long-horizon analysis focuses on a single model family (Qwen), though validated on others

Reproducibility

Analyzed publicly available checkpoints (DeepScaleR, DeepCoder, etc.). Code availability is not provided in the paper text.

📊 Experiments & Results

Evaluation Setup

Analysis of weight updates in RLVR checkpoints vs. Base/SFT models

Benchmarks:

AIME24 (Math Reasoning)
AMC23 (Math Reasoning)
MATH500 (Math Reasoning)

Metrics:

Update Sparsity (bfloat16-aware)
Principal Angle (Subspace Rotation)
Spectral Drift (Singular Value Change)
Jaccard Overlap (Update Consistency)
KL Divergence (Optimization Trajectory)
Pass@1 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Update sparsity analysis shows RLVR modifies significantly fewer parameters than SFT when measured with bfloat16-aware probing.
Analysis	Sparsity (bf16)	2.8	53.8	+51.0
Analysis	Sparsity (bf16)	0.6	69.5	+68.9
Sparse fine-tuning experiments demonstrate that freezing principal weights (SFT-style) hurts RLVR, while freezing non-principal weights works well.
AIME24	Pass@1	56.8	55.8	-1.0
AIME24	Pass@1	56.8	48.9	-7.9
PEFT comparison shows principal-targeted methods (PiSSA) fail to improve over standard LoRA in RLVR.
AIME24	Pass@1	50.0	30.0	-20.0

Experiment Figures

Consensus ratio of weight updates across 5 independent RLVR runs

Main Takeaways

RLVR updates are highly localized and consistent across different runs/seeds for a fixed pretrained model, indicating a model-conditioned bias
RLVR preserves the spectral geometry of the pretrained model (singular values, principal subspaces), whereas SFT distorts it
Intervention study: Random orthogonal rotation of layers destroys the update bias, proving it depends on the specific geometry of the weight matrices
SFT-era heuristics like 'update principal weights' are harmful for RLVR; effective RLVR updates occur in low-magnitude, off-principal regions

📚 Prerequisite Knowledge

Prerequisites

Singular Value Decomposition (SVD) and spectral analysis of weight matrices
Proximal Policy Optimization (PPO) and GRPO
bfloat16 floating point format and precision limits
Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—using objective outcomes (e.g., code execution, math answers) to train reasoning models

principal weights: Weights corresponding to the largest singular values/vectors of a layer's weight matrix, representing high-energy/high-importance directions

SFT: Supervised Fine-Tuning—standard training on labeled data, which this paper shows operates in a different geometric regime than RLVR

bfloat16: A 16-bit floating point format with limited precision (7 mantissa bits), which filters out small gradient updates, causing apparent sparsity

KL leash: The constraint imposed by RL algorithms that penalizes the policy for diverging too far from the reference model (Kullback-Leibler divergence)

off-principal: Directions in weight space orthogonal to the principal components; low-curvature regions where RL updates tend to concentrate

spectral drift: The change in the distribution of singular values of weight matrices during training; RLVR minimizes this compared to SFT

PiSSA: Principal Singular values and Singular vectors Adaptation—a PEFT method that initializes adapters using principal components, targeting high-energy directions

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines from groups of outputs for the same prompt