GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

📝 Paper Summary

Mathematical Reasoning Evaluation Robustness Analysis Benchmark Contamination

Current LLMs rely on fragile pattern matching rather than formal logical reasoning, demonstrated by catastrophic performance drops when irrelevant context is added or numerical values are changed in math problems.

Core Problem

The widely used GSM8K benchmark is static, allowing for data contamination and failing to capture the fragility of LLM reasoning under minor variations or irrelevant context.

Why it matters:

Reported metrics on GSM8K may be unreliable due to overfitting or contamination, creating a false sense of progress in mathematical reasoning
Models that cannot handle irrelevant information (No-Op) or variable changes are unreliable for real-world applications requiring genuine logic
Current single-point accuracy metrics mask high variance across different instantiations of the same logical problem

Concrete Example: In the GSM-NoOp dataset, a question asks about the number of kiwis a person has. An irrelevant clause is added: '5 of them were smaller than average.' The model blindly subtracts the 5 smaller kiwis from the total, even though size is irrelevant to the count, because it mimics subtraction patterns seen in training.

Key Novelty

GSM-Symbolic and GSM-NoOp Benchmarks

Creates symbolic templates from GSM8K questions to generate diverse instantiations with different values and names, enabling distribution-based evaluation rather than single-point metrics
Introduces GSM-NoOp, which inserts seemingly relevant but logically inconsequential clauses (e.g., about fruit size or color) to test if models can discern necessary information
Demonstrates that reasoning capabilities degrade as the number of clauses increases, supporting the hypothesis that models perform pattern matching rather than multi-step logic

Evaluation Highlights

Over 65% performance drop on GSM-NoOp for the Phi-3-mini model when irrelevant clauses are added
Performance variance of ~15% for Phi-3.5-mini across different numerical instantiations of the exact same reasoning problem
Adding a single clause (GSM-P1) causes significant performance drops across all 25 state-of-the-art models tested

Breakthrough Assessment

9/10

A critical reality check for the field. By exposing the extreme fragility of 'reasoning' models to irrelevant context and simple value changes, it fundamentally challenges the validity of current math benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of mathematical reasoning capabilities in Large Language Models under distribution shifts and adversarial noise

Inputs: Natural language grade-school math questions (variants of GSM8K)

Outputs: Reasoning chain and final numerical answer

Pipeline Flow

Template Creation (Symbolic annotation of GSM8K)
Data Generation (Instantiation of variables)
Evaluation (LLM Inference via CoT)

System Modules

Template Generator (Benchmark Construction)

Converts GSM8K samples into parsable templates with variable domains and conditions (e.g., ensuring divisibility)

Model or implementation: Rule-based Symbolic Engine

Variant Generator (Benchmark Construction)

Generates 50 distinct datasets by sampling names and values from templates

Model or implementation: Algorithmic Sampler

Evaluator

Runs target LLMs on generated datasets using 8-shot Chain-of-Thought

Model or implementation: Target LLMs (e.g., GPT-4o, Llama-3, Phi-3)

Novel Architectural Elements

Symbolic template-based evaluation framework allowing controllable generation of math problems
No-Op injection mechanism to test robustness against irrelevant information

Modeling

Base Model: Evaluated 25+ models including Llama-3-8B, Phi-3-mini, Gemma2-9B, GPT-4o, o1-preview

Training Method: Evaluation only (Inference with CoT prompting)

Adaptation: None (Pre-trained/Instruction-tuned models evaluated zero-shot or few-shot)

Trainable Parameters: 0 (Inference only)

Compute: Conducted nearly 500 total evaluations across 50 datasets

Comparison to Prior Work

vs. GSM-IC: GSM-Symbolic shows LLMs fail even with 8-shots of the same question, indicating fundamental reasoning flaws rather than just prompting issues
vs. GSM-Plus: GSM-Symbolic allows generating arbitrary numbers of instances via templates to measure performance distribution and variance
vs. GSM1K: GSM-Symbolic is publicly available and focuses on logic robustness via No-Op/Clause-scaling rather than just overfitting checks

Limitations

The analysis is limited to the GSM8K dataset logic and grade-school math level
Relies on the assumption that original GSM8K solutions are the 'gold standard' for reasoning steps
Does not propose a new training method to fix the discovered fragility, only diagnoses it

Reproducibility

Code: https://github.com/apple/ml-gsm-symbolic

Templates and data generation code are publicly available at https://github.com/apple/ml-gsm-symbolic. The prompt template used (8-shot CoT) is provided in Figure 9 of the paper.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning evaluation using symbolic variants of GSM8K

Benchmarks:

GSM-Symbolic (Grade School Math (Variable Instantiations)) [New]
GSM-NoOp (Grade School Math with Irrelevant Context) [New]
GSM-P1 / GSM-P2 (Grade School Math with Increased Complexity (Plus 1/2 Clauses)) [New]

Metrics:

Accuracy (Standard Pass@1)
Performance Variance (across instantiations)
Performance Drop (Delta between GSM8K and variants)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experiments measuring performance variance across different numerical instantiations of the same logical templates.
GSM-Symbolic	Performance Gap (Best vs Worst Instantiation)	0	15	+15
GSM-Symbolic	Performance Gap (Best vs Worst Instantiation)	0	12	+12
Experiments on the GSM-NoOp dataset, measuring the impact of adding irrelevant information.
GSM-NoOp	Performance Drop (Relative)	0	-65	-65

Experiment Figures

Distribution of accuracy for various models across 50 generated GSM-Symbolic datasets

Performance comparison between GSM-Symbolic and GSM-NoOp (with irrelevant clauses)

Main Takeaways

LLMs exhibit significant performance variance (12-15%) on GSM-Symbolic when only numerical values are changed, suggesting they do not possess stable logical reasoning
Performance consistently degrades as the number of clauses increases (GSM-P1, GSM-P2), with variance increasing simultaneously
Models show catastrophic failure on GSM-NoOp (up to 65% drop), often converting irrelevant statements (e.g., 'discount') into mathematical operations (e.g., multiplication) blindly
Providing 8-shot examples of the *same* question (NoOp-Symb) does not recover performance on GSM-NoOp, indicating the issue is deeper than in-context learning can fix

📚 Prerequisite Knowledge

Prerequisites

Familiarity with GSM8K benchmark
Concept of Chain-of-Thought (CoT) prompting
Understanding of overfitting and data contamination in LLMs

Key Terms

GSM8K: Grade School Math 8K—a popular benchmark dataset of 8.5k high-quality linguistically diverse grade school math word problems

GSM-Symbolic: A new benchmark proposed in this paper utilizing symbolic templates to generate diverse variants of GSM8K questions

GSM-NoOp: An adversarial dataset variant where seemingly relevant but logically irrelevant information (No-Operation) is added to the question

Chain-of-Thought (CoT): A prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer

Pattern matching: The process where models replicate reasoning steps observed in training data based on surface-level statistical correlations rather than understanding the underlying logic

Token bias: The tendency of a model's output to change drastically based on small changes in input tokens (e.g., proper names or numbers)

No-Op: No-Operation; a statement added to a problem that describes a state or attribute but requires no mathematical operation to solve the core question