GSPO: Group Sequence Policy Optimization—the proposed RL algorithm that clips and updates policies based on sequence-level likelihood ratios
GRPO: Group Relative Policy Optimization—a baseline RL algorithm that normalizes rewards within a group of outputs and uses token-level clipping
Importance Sampling: A technique to estimate properties of a target distribution using samples from a different (behavior) distribution by weighting them
MoE: Mixture-of-Experts—a model architecture where different 'expert' sub-networks are activated for different inputs, often causing training instability in RL
Routing Replay: A stabilization strategy for GRPO on MoEs that forces the new policy to reuse the specific experts activated by the old policy to reduce variance
Model Collapse: A failure mode in RL where the model's output quality degrades drastically and irreversibly during training
KL regularization: A penalty term keeping the trained policy close to the reference policy to prevent reward hacking or mode collapse
Off-policy: Learning from data generated by an older version of the policy rather than the current one being optimized