PPO: Proximal Policy Optimization—an RL algorithm that uses clipped importance weights to prevent destructive policy updates
GRPO: Group Relative Policy Optimization—an online RL algorithm that baselines advantages against a group of samples for the same prompt
RLOO: REINFORCE Leave-One-Out—an unbiased advantage estimator that uses peer samples as the baseline for the current sample
Entropy Collapse: The phenomenon where a model's probability distribution narrows too quickly, losing the ability to generate diverse outputs
FSDP2: Fully Sharded Data Parallel (Version 2)—a PyTorch framework for distributed training that handles parameter sharding
BF16: BFloat16—a 16-bit floating point format commonly used in ML that can suffer from precision issues in entropy calculations
GSPO: Group Sequence Policy Optimization—an algorithm using a trajectory-level trust region based on geometric average probability ratios
REPO: A proposed family of algorithms that modify the advantage function to regulate entropy (introduced in this paper)
ADAPO: Adaptive Asymmetric PPO—a proposed approach using adaptive clipping to maintain target entropy levels
pass@k: A metric measuring the probability that at least one of k generated solutions is correct