RLVR: Reinforcement Learning with Verifiable Rewards—training LLMs using binary success/failure signals (e.g., correct math answer) rather than human preference models
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of sampled outputs for the same prompt, removing the need for a separate value network critic
Item Response Theory (IRT): A psychometric paradigm used to model the relationship between a test taker's ability and the difficulty of items they attempt
SFT: Supervised Fine-Tuning—training on ground-truth data using standard cross-entropy loss
Hint Scaffolding: Providing a prefix of the ground-truth solution to the model to guide its generation and reduce exploration difficulty
Rollout: A complete sequence generated by the model starting from a prompt (and potentially a hint)
3PL: Three-Parameter Logistic model—a specific IRT model curve defined by discrimination, difficulty, and guessing parameters
CoT: Chain-of-Thought—intermediate reasoning steps generated by the model before the final answer