GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a sequence's reward to the average reward of a group of samples for the same prompt, avoiding a critic model.
GTPO: Group Token Policy Optimization—the proposed token-level algorithm that assigns unique, entropy-weighted rewards to every token.
GRPO-S: Sequence-Level Group Relative Policy Optimization—the proposed sequence-level variant that modulates the global reward for a sequence based on its average entropy.
DAPO: Direct Advantage Policy Optimization—a baseline RL method similar to GRPO but often using different reward normalization or loss formulations.
CoT: Chain-of-Thought—a prompting strategy where models generate intermediate reasoning steps before the final answer.
Policy Entropy: A measure of the randomness or uncertainty in the model's next-token prediction distribution.
Importance Sampling: A technique used in RL (specifically PPO/GRPO) to estimate properties of a target distribution while sampling from a different (older) distribution, using a ratio of probabilities.