RLVR: Reinforcement Learning with Verifiable Rewards—RL where rewards are determined by programmatic checks (e.g., correct math answer, passing unit tests).
Outcome Verifier: A deterministic function that checks if a model's final answer matches the ground truth.
PPO: Proximal Policy Optimization—a policy gradient RL algorithm that prevents drastic policy updates using a clipped objective function.
Online RL: Training where the model updates its policy based on data it generates in real-time during training, rather than static pre-collected data.
GAE: Generalized Advantage Estimation—a method to estimate the advantage function (how good an action is) by balancing bias and variance.
Superficial self-reflection: A failure mode where models appear to critique their work but lack the genuine ability to identify errors, often verifying based on surface features rather than logic.
Pass@k: A metric measuring the probability that at least one correct solution is generated within k attempts.