ScaleRL: The authors' proposed RL training recipe combining PipelineRL, CISPO, interruptions, and specific normalization techniques
CISPO: Clipped Importance Sampling Policy Optimization—a loss function combining truncated importance sampling with vanilla policy gradient
DAPO: An RL algorithm using asymmetric clipping to manage updates, often used to prevent entropy collapse
PipelineRL: An asynchronous RL setup where generators stream data continuously while trainers update weights, reducing GPU idle time
GRPO: Group Relative Policy Optimization—a baseline RL method that normalizes rewards within a group of generations for the same prompt
Sigmoidal Scaling: Modeling performance R(C) = A / (1 + (C_mid / C)^B), where A is the asymptote and B is the scaling exponent
FP32: 32-bit floating point precision, found to be critical for logit computation to prevent numerical instability
Zero-variance filtering: Removing prompts from the loss calculation where all generated responses yield the same reward (zero advantage)
No-Positive-Resampling: A curriculum strategy where prompts are permanently removed from training once the model achieves a high pass rate (>= 0.9)