MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations

📝 Paper Summary

Mathematical Reasoning Robustness Generalization

MATH-Perturb introduces a benchmark of 'hard' problem variations—where original solution methods no longer apply—revealing that top LLMs rely heavily on memorized reasoning patterns rather than genuine understanding.

Core Problem

LLMs achieve high scores on math benchmarks, but it is unclear if they truly reason or merely memorize solution templates, as existing robustness tests only use simple numerical perturbations that preserve original reasoning paths.

Why it matters:

High benchmark scores may be artificial due to data contamination or over-representation of specific problem types in training data
Models that rely on 'bag-of-heuristics' reasoning fail when problem conditions change fundamentally, making them unreliable for real-world applications where contexts shift
Current evaluations underestimate the fragility of reasoning models because they do not test 'hard perturbations' that invalidate the memorized solution steps

Concrete Example: A model might solve a geometry problem correctly using a symmetry argument. When the problem is modified (hard perturbation) such that symmetry no longer holds, the model may blindly apply the original symmetry-based solution anyway, producing a wrong answer (as seen in Figure 5).

Key Novelty

Distinguishing 'Hard' vs. 'Simple' Reasoning Perturbations

Introduces MATH-P-Hard, a dataset where problems are textually similar to the original but fundamentally altered so that original solution paths are invalid (e.g., breaking a symmetry assumption)
Contrasts this with MATH-P-Simple, where only non-essential values change, allowing the assessment of whether a model understands the *why* or just memorizes the *how*

Architecture

Conceptual comparison between Simple Perturbation and Hard Perturbation applied to a math problem.

Evaluation Highlights

o1-mini performance drops by 16.49% on MATH-P-Hard compared to original problems, indicating reliance on specific reasoning patterns
gemini-2.0-flash-thinking suffers a 12.9% accuracy drop on MATH-P-Hard, showing vulnerability even in strong reasoning models
In-context learning with original problems hurts performance on MATH-P-Hard for large models (18%-40% of correct answers become wrong) due to misleading demonstrations

Breakthrough Assessment

8/10

Significant contribution to evaluation methodology. Exposes a critical weakness (solution template memorization) in SOTA reasoning models that standard benchmarks miss.

⚙️ Technical Details

Problem Definition

Setting: Mathematical Question Answering under Perturbation

Inputs: Mathematical problem text P (original or perturbed)

Outputs: Solution steps and final answer A

Pipeline Flow

Dataset Construction (Human Expert Annotation)
Model Inference (Zero-shot CoT)
Equivalence Checking (SymPy)

System Modules

Human Annotator

Create perturbed problems (Simple and Hard) from Level-5 MATH problems

Model or implementation: PhD students (Human Experts)

Reasoning Model

Generate solution and answer for the problem

Model or implementation: Various LLMs (e.g., o1-mini, GPT-4o, Claude-3.5-Sonnet)

Equivalence Checker

Verify if generated answer matches ground truth

Model or implementation: SymPy / Python Script

Modeling

Base Model: Evaluated 18 LLMs including o1-mini, gemini-2.0-flash-thinking, GPT-4o, Claude-3.5-Sonnet

Compute: Not reported in the paper (Evaluation only, no training)

Comparison to Prior Work

vs. Functional-MATH: Functional-MATH only tests simple perturbations (numerical changes); MATH-Perturb tests hard perturbations where the solution logic must change
vs. GSM8K/SVAMP: Focuses on Level-5 (hardest) high school competition problems rather than grade school math
vs. Standard MATH Benchmark: Isolates 'reasoning' from 'memorization' by invalidating the specific solution paths found in the training data

Limitations

Benchmark size is relatively small (279 problem pairs) due to the high cost of expert annotation
Focuses only on Level-5 problems, potentially overlooking behaviors on simpler tasks
Evaluated primarily on accuracy; deeper automated analysis of *why* the reasoning failed (beyond manual inspection) is limited

Reproducibility

Data curation process is described in detail (12 experts, cross-validation). MATH-P-Simple and MATH-P-Hard datasets are constructed from public MATH dataset. Code URL not provided in the text. Explicit list of 18 models tested is in Appendix A.

📊 Experiments & Results

Evaluation Setup

Zero-shot Chain-of-Thought (CoT) on math problems without tool usage

Benchmarks:

MATH (Original) (Mathematical Reasoning (Level 5))
MATH-P-Simple (Mathematical Reasoning (Simple Perturbations)) [New]
MATH-P-Hard (Mathematical Reasoning (Hard Perturbations)) [New]

Metrics:

Accuracy
Performance Drop (Delta between Original and Perturbed)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance drops on Hard Perturbations (MATH-P-Hard) indicate models struggle when original solution paths are invalidated.
MATH-P-Hard	Accuracy Drop (Relative to Original)	0.0	-16.49	-16.49
MATH-P-Hard	Accuracy Drop (Relative to Original)	0.0	-12.9	-12.9
In-Context Learning (ICL) experiments reveal a 'misleading effect' where original examples hurt performance on hard perturbations.
MATH-P-Hard	Error Rate Increase (Correct -> Wrong)	0	40	+40

Experiment Figures

Impact of In-Context Learning (ICL) on model performance for Simple vs. Hard perturbations.

Main Takeaways

All evaluated models, including o1-mini and Gemini-2.0, suffer significant performance drops (10-25%) on MATH-P-Hard, confirming a bias towards original reasoning patterns.
A new form of memorization is identified: models blindly apply learned problem-solving skills (templates) to modified contexts where they are no longer applicable.
In-Context Learning with original problems helps for Simple perturbations but frequently misleads models on Hard perturbations, as models fail to recognize subtle but critical differences.
Existing benchmarks using simple numerical perturbations overestimate model robustness; hard perturbations are necessary to test true understanding.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
In-Context Learning (ICL)
Data Contamination / Memorization in LLMs

Key Terms

MATH-P-Simple: A benchmark subset where problems are modified (e.g., changing numbers) but the underlying reasoning logic remains identical to the original problem

MATH-P-Hard: A benchmark subset where problems are modified such that the original solution path is strictly invalid, requiring new reasoning strategies

CoT: Chain-of-Thought—a prompting method where the model generates intermediate reasoning steps before the final answer

ICL: In-Context Learning—providing examples (demonstrations) in the prompt to guide the model's behavior for the current task

Simple perturbation: Modifying non-critical parameters (like numerical values) that do not alter the fundamental reasoning pattern

Hard perturbation: Fundamental modifications to problem formulation that render the original solution method inapplicable

Edit distance: A metric measuring the textual similarity between the original and perturbed problem strings

Mode collapse: In this context, when a model fails to recognize the perturbation and outputs the answer or reasoning steps of the original, unmodified problem