RLP: Reinforcement Learning Pre-training—the proposed method of rewarding thoughts that improve next-token prediction during pretraining
CoT: Chain-of-Thought—intermediate reasoning steps generated by the model to help solve a problem
EMA: Exponential Moving Average—a technique where model weights are updated slowly over time to create a stable reference (teacher) model
NTP: Next-Token Prediction—the standard objective function for training language models
SFT: Supervised Fine-Tuning—training on labeled input-output pairs (e.g., instruction following)
RLVR: Reinforcement Learning with Verifier Rewards—using an external checker (like a code compiler or math solver) to provide feedback
clipped surrogate: A loss function used in PPO (Proximal Policy Optimization) that prevents the model from changing too much in one update step
information gain: The difference in log-likelihood of the correct token between the reasoning model and the no-thought baseline
teacher forcing: Training technique where the model is fed the actual ground truth tokens as history, rather than its own previous predictions