RLVR: Reinforcement Learning with Verifiable Rewards—using objective outcomes (e.g., code execution, math answers) to train reasoning models
principal weights: Weights corresponding to the largest singular values/vectors of a layer's weight matrix, representing high-energy/high-importance directions
SFT: Supervised Fine-Tuning—standard training on labeled data, which this paper shows operates in a different geometric regime than RLVR
bfloat16: A 16-bit floating point format with limited precision (7 mantissa bits), which filters out small gradient updates, causing apparent sparsity
KL leash: The constraint imposed by RL algorithms that penalizes the policy for diverging too far from the reference model (Kullback-Leibler divergence)
off-principal: Directions in weight space orthogonal to the principal components; low-curvature regions where RL updates tend to concentrate
spectral drift: The change in the distribution of singular values of weight matrices during training; RLVR minimizes this compared to SFT
PiSSA: Principal Singular values and Singular vectors Adaptation—a PEFT method that initializes adapters using principal components, targeting high-energy directions
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines from groups of outputs for the same prompt