SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, specific dataset to adapt its behavior
CoT: Chain-of-Thought—a prompting or training method where the model generates intermediate reasoning steps before the final answer
R1-style: Reasoning traces featuring extended chain-of-thought with self-reflection mechanisms, characterized by substantial length and explicit verification steps
AIME24: American Invitational Mathematics Examination 2024—a high-difficulty math competition dataset used as the primary testbed
avg@n: The average pass rate obtained by generating n solutions (with temperature=1) and averaging the binary success outcomes
cov@n: Coverage at n—indicates whether the model succeeds in at least one of the n attempts
OpenR1-Math-220k: A dataset of math problems paired with reasoning traces generated by DeepSeek-R1
Exh: Extremely Hard—the highest difficulty tier of AIME24 problems identified in this paper, on which SFT models typically achieve 0% accuracy