RLVR: Reinforcement Learning with Verifiable Rewards—using objective correctness (like exact match) to guide RL training
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same input to reduce variance without a value network
Dead Ends: Situations in RL training where the model consistently fails to find a correct answer across multiple rollout attempts, stalling learning
Importance Sampling: A statistical technique used to estimate properties of a target distribution while sampling from a different 'behavior' distribution by weighting samples
Policy Correction: Adjusting the learning update to account for the difference between the exploration policy (probe) and the target policy to prevent bias
Aleatoric Uncertainty: Uncertainty arising from inherent randomness in the data or task
Epistemic Uncertainty: Uncertainty arising from the model's lack of knowledge, which can be reduced with more data or reasoning
Probe Policy: A temporary auxiliary policy used to generate exploratory trajectories (often via prompting) to help the main policy escape local optima