GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a sample's reward to the group average of multiple samples for the same prompt.
Off-policy RL: Training a reinforcement learning model using data generated by a previous version of the policy (stale data), rather than the current policy.
Importance Sampling: A technique to estimate properties of a target distribution using samples from a different distribution (behavior policy) by weighting samples based on the likelihood ratio.
Policy Entropy: A measure of the randomness or exploration capability of the policy; high entropy means the model explores diverse outputs, while low entropy indicates over-exploitation.
Data Staleness: The degree to which the data used for training lags behind the current policy parameters; higher staleness means the data comes from significantly older versions of the model.
PPO: Proximal Policy Optimization—a standard RL algorithm that uses clipping to prevent the new policy from deviating too far from the old policy.
Partial Rollout: An infrastructure optimization where long sequences are generated in segments; unfinished segments are stored and resumed later, creating off-policy data.
SFT: Supervised Fine-Tuning—training the model on high-quality labeled demonstrations before applying RL.