RLVR: Reinforcement Learning with Verifiable Rewards—training LLMs by generating solutions, verifying them (e.g., via unit tests), and updating based on the result
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of samples from the same prompt, removing the need for a separate critic model
Youden's Index: A statistic (J = TPR - FPR) measuring the performance of a binary diagnostic test; J=1 is perfect, J=0 is random chance
TPR: True Positive Rate—the probability that a correct solution is rewarded as correct
FPR: False Positive Rate—the probability that an incorrect solution is erroneously rewarded as correct
Replicator Dynamics: A mathematical model from evolutionary game theory describing how the proportion of different types (strategies) in a population changes over time based on their relative fitness
Phase Transition: A sharp change in the behavior of a system (here, from learning to anti-learning) as a parameter (Youden's Index) crosses a critical threshold
Logit: The raw, unnormalized output score from the model before being converted into a probability