MATH-P-Simple: A benchmark subset where problems are modified (e.g., changing numbers) but the underlying reasoning logic remains identical to the original problem
MATH-P-Hard: A benchmark subset where problems are modified such that the original solution path is strictly invalid, requiring new reasoning strategies
CoT: Chain-of-Thought—a prompting method where the model generates intermediate reasoning steps before the final answer
ICL: In-Context Learning—providing examples (demonstrations) in the prompt to guide the model's behavior for the current task
Simple perturbation: Modifying non-critical parameters (like numerical values) that do not alter the fundamental reasoning pattern
Hard perturbation: Fundamental modifications to problem formulation that render the original solution method inapplicable
Edit distance: A metric measuring the textual similarity between the original and perturbed problem strings
Mode collapse: In this context, when a model fails to recognize the perturbation and outputs the answer or reasoning steps of the original, unmodified problem