RLVR: Reinforcement Learning with Verifiable Rewards—using outcome-based feedback (e.g., code compiles and runs) rather than human labels
Abduction: Reasoning mode: Inferring a plausible input given a program and an output (trial-and-error/search)
Deduction: Reasoning mode: Predicting the output given a program and an input (step-by-step execution)
Induction: Reasoning mode: Synthesizing a program given a set of input-output examples (generalization)
SFT: Supervised Fine-Tuning—training on labeled examples (not used here for the 'zero' paradigm)
TRR++: Task-Relative REINFORCE++—a newly proposed advantage estimator that normalizes baselines per task-role configuration
Learnability Reward: A reward signal for the Proposer that peaks when the Solver has a moderate success rate (neither 0% nor 100%), encouraging an appropriate curriculum
Uh-oh moment: A safety concern where the model generates concerning chains of thought during self-play