RLHF: Reinforcement Learning from Human Feedback—alignment training using reward models trained on human preferences
RLVR: Reinforcement Learning with Verifiable Rewards—RL training using ground-truth verifiers (e.g., code execution, math answer checking) rather than human preference models
Thinking Mode: A generation mode where the model produces a long reasoning chain (internal monologue) before the final answer, usually improving performance on complex tasks
Catastrophic Forgetting: A phenomenon where a model loses previously learned knowledge or skills when trained on new data/domains
SFT: Supervised Fine-Tuning—training the model on curated prompt-response pairs to establish baseline behavior
IOI: International Olympiad in Informatics—a prestigious competitive programming competition used as a high-difficulty benchmark
Pass@1: A metric measuring the percentage of problems where the model's first generated solution is correct
Cascade RL: The authors' proposed method of performing RL sequentially across domains (e.g., Math then Code) rather than jointly