← Back to Paper List

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Iman Mirzadeh, Keivan Alizadeh-Vahid, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar
Washington State University
International Conference on Learning Representations (2024)
Reasoning Benchmark Factuality

📝 Paper Summary

Mathematical Reasoning Evaluation Robustness Analysis Benchmark Contamination
Current LLMs rely on fragile pattern matching rather than formal logical reasoning, demonstrated by catastrophic performance drops when irrelevant context is added or numerical values are changed in math problems.
Core Problem
The widely used GSM8K benchmark is static, allowing for data contamination and failing to capture the fragility of LLM reasoning under minor variations or irrelevant context.
Why it matters:
  • Reported metrics on GSM8K may be unreliable due to overfitting or contamination, creating a false sense of progress in mathematical reasoning
  • Models that cannot handle irrelevant information (No-Op) or variable changes are unreliable for real-world applications requiring genuine logic
  • Current single-point accuracy metrics mask high variance across different instantiations of the same logical problem
Concrete Example: In the GSM-NoOp dataset, a question asks about the number of kiwis a person has. An irrelevant clause is added: '5 of them were smaller than average.' The model blindly subtracts the 5 smaller kiwis from the total, even though size is irrelevant to the count, because it mimics subtraction patterns seen in training.
Key Novelty
GSM-Symbolic and GSM-NoOp Benchmarks
  • Creates symbolic templates from GSM8K questions to generate diverse instantiations with different values and names, enabling distribution-based evaluation rather than single-point metrics
  • Introduces GSM-NoOp, which inserts seemingly relevant but logically inconsequential clauses (e.g., about fruit size or color) to test if models can discern necessary information
  • Demonstrates that reasoning capabilities degrade as the number of clauses increases, supporting the hypothesis that models perform pattern matching rather than multi-step logic
Evaluation Highlights
  • Over 65% performance drop on GSM-NoOp for the Phi-3-mini model when irrelevant clauses are added
  • Performance variance of ~15% for Phi-3.5-mini across different numerical instantiations of the exact same reasoning problem
  • Adding a single clause (GSM-P1) causes significant performance drops across all 25 state-of-the-art models tested
Breakthrough Assessment
9/10
A critical reality check for the field. By exposing the extreme fragility of 'reasoning' models to irrelevant context and simple value changes, it fundamentally challenges the validity of current math benchmarks.
×