Knights and Knaves (K&K): A class of logic puzzles where characters are either truth-tellers (Knights) or liars (Knaves), used here as a controllable synthetic dataset
REINFORCE++: A variant of the REINFORCE algorithm used for RL fine-tuning, often incorporating diverse batching or normalization improvements
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines from group averages to reduce variance
PPO: Proximal Policy Optimization—a standard RL algorithm that constrains policy updates to ensure stability
SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs
KL divergence: A metric measuring how much the trained model's probability distribution deviates from the reference model, often used as a penalty to prevent mode collapse
AIME: American Invitational Mathematics Examination—a challenging high-school math competition
AMC: American Mathematics Competitions—standardized math competitions for middle/high school students
OOD: Out-of-Distribution—tasks or data that differ significantly from the training data statistics
Process Reward Model (PRM): A reward model that provides feedback on intermediate steps of reasoning, not just the final answer