URLVR: Unsupervised Reinforcement Learning with Verifiable Rewards—learning from proxy rewards derived without human labels
Intrinsic Rewards: Reward signals derived solely from the model's own internal state (e.g., confidence, consistency) rather than external verification
External Rewards: Reward signals derived from external verification processes (e.g., code execution, proof checkers) or unlabeled data structure
Sharpening Mechanism: The theoretical convergence behavior where a policy becomes deterministic around its initial preferences, reducing entropy regardless of correctness
Model Collapse Step: A proposed metric measuring how many training steps it takes for a model to degrade, used as a proxy for the quality of the model's prior knowledge
REINFORCE: A policy gradient algorithm that updates the model to maximize expected reward
Test-Time Training: Updating the model parameters temporarily on a specific test instance or small batch at inference time to improve performance
Reward Hacking: When a model learns to exploit flaws in the reward function to get high scores without actually achieving the intended goal (e.g., maximizing confidence without correctness)