DPO: Direct Preference Optimization—an alignment method that optimizes a policy to prefer 'winning' responses over 'losing' ones without training a separate reward model.
SFT: Supervised Fine-Tuning—training a model on high-quality input-output pairs (e.g., questions and correct solutions) before applying alignment techniques.
SCDPO: Step-Controlled DPO—the proposed method that creates negative samples branching from specific steps in a correct solution to provide fine-grained supervision.
GSM8K: Grade School Math 8K—a dataset of 8.5K high-quality linguistically diverse grade school math word problems.
MATH: A dataset of 12.5K challenging competition mathematics problems.
temperature: A hyperparameter in the softmax function that controls the randomness of the model's output; higher values increase diversity and the likelihood of errors.
Code-Integrated Solution: A solution format where reasoning steps alternate between natural language and executable code (typically Python).
Chain-of-Thought: A prompting strategy where the model generates intermediate reasoning steps before the final answer.