LRM: Large Reasoning Model—an LLM specifically optimized for complex reasoning tasks (math, code, logic) via RL.
RLVR: Reinforcement Learning with Verifiable Rewards—using objective, rule-based signals (e.g., unit tests, correct answers) instead of human preference models.
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same prompt, eliminating the need for a separate value function (critic) model.
PPO: Proximal Policy Optimization—a standard RL algorithm that updates policies with a clipped objective to ensure stability.
RLHF: Reinforcement Learning from Human Feedback—aligning models using a reward model trained on human preferences.
DPO: Direct Preference Optimization—optimizing the policy directly on preference data without an explicit reward model.
CoT: Chain-of-Thought—intermediate reasoning steps generated by the model before the final answer.
GenRM: Generative Reward Model—an LLM-based reward model that produces textual critiques/reasoning rather than just a scalar score.
Process Reward Model (PRM): A reward model that evaluates intermediate steps of reasoning rather than just the final outcome.
Verifier's Law: The principle that tasks with robust automated verification (e.g., math, code) are easiest to improve via RL.