GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines from the average reward of a group of rollouts for the same prompt, removing the need for a separate critic model
RLVR: Reinforcement Learning with Verifiable Rewards—RL where the reward is determined by a deterministic verifier (e.g., code execution or math answer check) rather than a learned reward model
ICL seeding: In-Context Learning seeding—injecting solved examples into the prompt context during training to help the model generate a correct response for hard problems
pass@1: The probability that a single generated solution is correct
cons@32: Consistency metric measuring the agreement or correctness across 32 sampled rollouts
Student's t-confidence interval: A statistical range used here to estimate the uncertainty of the mean reward for a specific prompt based on limited samples
novelty: A measure of how unexpected a correct sequence is under the model's current distribution, calculated using length-normalized log-likelihood
advantage sharpening: Modifying the computed advantage (learning signal) to give extra weight to specific high-value rollouts (here, novel correct answers)