RLVR: Reinforcement Learning with Verifiable Rewards—training models using correct/incorrect feedback on final answers rather than human preferences
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a sample's reward to the average reward of a group of samples for the same prompt
segment rollout: A decoding strategy where generation is paused at fixed intervals (segments) to allow immediate training on completed samples, rather than waiting for the entire batch to finish
entropy collapse: A failure mode in RL where the model's output distribution becomes too deterministic (peaked), leading to a loss of diversity and exploration
MPTs: Well-Mastered Positive Tokens—tokens in correct answers that the model already predicts with very high probability (e.g., >0.99)
importance sampling: A technique to estimate properties of a target distribution using samples from a different proposal distribution, corrected by a weight ratio
on-policy: Training on data generated by the current version of the model being optimized
off-policy: Training on data generated by an older version of the model
SAIS: Segment-Aware Importance Sampling—calculating importance weights separately for each segment based on which historical model version generated it
POIS: Pseudo On-Policy Importance Sampling—treating the most recent segment as on-policy (weight=1) and previous segments as if they were on-policy to avoid clipping