D-CoT: Disciplined Chain-of-Thought—the proposed framework using control tags to structure reasoning
ORPO: Odds Ratio Preference Optimization—an alignment method that integrates preference learning directly into the supervised fine-tuning loss, used here to favor disciplined reasoning
SLM: Small Language Model—typically models <10B parameters (here Qwen3-8B)
Overthinking: A failure mode where models generate excessive, circular, or drifting reasoning steps that degrade performance
Control Tags: Special tokens (<TEMP_LOW>, <TEMP_MID>, <TEMP_HIGH>) used to signal the intended mode of reasoning (fact-checking, convergence, exploration)
Internalization: The phenomenon where the model learns the structured reasoning patterns and performs well even without explicit control tags during inference
Pareto frontier: The set of optimal trade-offs; here, D-CoT improves both accuracy and efficiency (token count) simultaneously
GPQA-diamond: A challenging benchmark dataset consisting of expert-level science questions
Qwen3: The family of language models used in the paper (Qwen3-8B student, Qwen3-235B teacher)