GRPO: Group Relative Policy Optimization—a reinforcement learning method that estimates baselines from the average score of a group of completions rather than using a separate critic model
PPO: Proximal Policy Optimization—a standard RL algorithm that uses a clipped surrogate objective to ensure stable policy updates
Advantage: A measure of how much better a specific action (completion) is compared to the average baseline performance
Completion Pruning: The process of discarding generated responses that have low information value (low absolute advantage) before performing expensive gradient computations
Bucket Effect: A phenomenon in parallel computing where the overall speed is limited by the device processing the largest workload (the 'slowest' bucket)
Pass@1: The probability that a model generates a correct answer in its first attempt
Chain of Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer
vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs
KL divergence: Kullback-Leibler divergence—a statistical distance measuring how one probability distribution differs from a reference distribution