Long CoT: Extended reasoning traces that include explicit steps for reflection, backtracking, and self-validation (e.g., 'Wait, let me check that')
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning method that freezes pre-trained weights and injects trainable rank-decomposition matrices
SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs
LRM: Large Reasoning Model—models specifically optimized for complex multi-step reasoning tasks (e.g., OpenAI o1, DeepSeek-R1)
DeepSeek-R1: A strong open-source reasoning model used as a 'teacher' to generate training data in this paper
QwQ: Qwen-based reasoning model used as a teacher for distillation
AIME: American Invitational Mathematics Examination—a challenging high-school math competition benchmark