PPO: Proximal Policy Optimization—an RL algorithm that updates policies using a clipped objective to ensure small, stable updates
SFT: Supervised Fine-Tuning—initial training of a model on labeled data before RL
KV cache: Key-Value cache—stored intermediate computations in Transformers to speed up token generation
staleness: The difference in version between the model parameters used to generate data and the current model parameters being trained
behavior policy: The policy version actually used to generate the rollout data
proximal policy: A recent policy version used as a reference point in the PPO loss to prevent the model from drifting too far
rollout: The process of the model generating text (reasoning traces) based on a prompt
straggler: A task (in this case, a long generation sequence) that takes much longer than others, forcing the whole batch to wait