GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines by averaging rewards across a group of outputs for the same input, avoiding a separate value network
S-GRPO: Serial-Group Decaying-Reward Policy Optimization—the proposed method that groups outputs serially from one path via early exits rather than parallel sampling
Chain-of-Thought (CoT): A prompting strategy where the model generates intermediate reasoning steps before the final answer
Overthinking: The tendency of reasoning models to generate redundant or unnecessary reasoning steps
Pass@k: An evaluation metric measuring the probability that at least one of k generated samples is correct
Early Exit: Terminating the generation process before the maximum length is reached to save computation
Rollout: The process of generating a sequence of actions (tokens) from the policy
Decaying Reward: A reward function that decreases in value as the sequence length increases, penalizing longer sequences
Policy Gradient: An optimization technique that updates the model parameters to maximize expected reward by following the gradient of the reward with respect to the policy