RLVR: Reinforcement Learning with Verifiable Rewards—optimizing models using objective, computable success signals (like math answers or code execution) rather than human preference labels
Empirical Support: The set of correct solutions that a model can realistically discover under a finite sampling budget (probability greater than a small epsilon)
SRR: Support Retention Rate—the fraction of the base model's accessible correct solutions that remain accessible after RLVR training
NDR: Net Discovery Rate—the fraction of the RLVR model's accessible solutions that were effectively inaccessible to the base model (genuine discoveries)
NSCR: Net Support Change Rate—a metric capturing whether the accessible solution space has expanded (positive) or shrunk (negative) overall
Pass@k: A metric measuring the probability that at least one correct solution is generated within k independent samples
Invisible Leash: The hypothesis that RLVR is fundamentally constrained by the base model's initial distribution and cannot easily discover reasoning patterns outside that support
SDS: Support Dynamic Score—a harmonic mean balancing retention of old solutions and discovery of new ones
GRPO: Group Relative Policy Optimization—an RL algorithm used for reasoning models (like DeepSeek-R1) that normalizes rewards within a group of samples for the same prompt