Large Language Models for Mathematical Reasoning: Progresses and Challenges

📝 Paper Summary

Mathematical Reasoning Survey of LLMs

This survey comprehensively categorizes the landscape of LLM-based mathematical reasoning, covering problem types, datasets, prompting/fine-tuning techniques, and persisting challenges to unify disparate research efforts.

Core Problem

The rapid growth of LLM-based math reasoning research has led to a vast, varied landscape with disparate datasets and metrics, making it difficult to discern true progress or shared obstacles.

Why it matters:

Current research is siloed by problem type (e.g., arithmetic vs. geometry), hampering the understanding of obstacles across the broader field
A lack of a unified framework prevents accurate assessment of whether LLMs are truly achieving generalized mathematical capabilities or just overfitting to specific tasks

Concrete Example: Evaluating an LLM on simple arithmetic (e.g., '21 + 97') fails to capture its ability to handle spatial geometry problems (e.g., 'What is its area?') or the rigorous logic required for Automated Theorem Proving, leading to inconsistent performance claims across the literature.

Key Novelty

Four-Dimensional Survey Framework

Structurally organizes the field into four dimensions: (1) problem types/datasets, (2) LLM-oriented techniques (prompting vs. fine-tuning), (3) influencing factors, and (4) persisting challenges
Distinguishes itself from prior surveys by specifically focusing on LLMs (rather than general Deep Learning) and incorporating educational perspectives

⚙️ Technical Details

Problem Definition

Setting: Automated resolution of mathematical problems across diverse modalities and difficulty levels

Inputs: Mathematical queries in various formats: Arithmetic expressions, Textual Word Problems (MWP), Geometry (visual/symbolic), Theorem Proving conjectures, or Tabular contexts

Outputs: Final answers, step-by-step rationales, proofs, or generated variations of problems

Reproducibility

Not provided (Survey paper). Citations for all discussed datasets (Math-140, TabMWP, ChartQA, etc.) are included in the text.

📊 Experiments & Results

Main Takeaways

Classifies mathematical reasoning into five distinct domains: Arithmetic, Math Word Problems (MWP), Geometry, Automated Theorem Proving (ATP), and Math in vision-language context, each requiring different datasets (e.g., Math-140 vs. MiniF2F)
Identifies a progression of methodologies: from Prompting frozen LLMs (e.g., Chain-of-Thought), to Enhancing them with external tools (e.g., Python REPL, Program-of-Thought), to Fine-tuning specific models (e.g., Minerva, MathInstruct)
Highlights that evaluation must go beyond accuracy to include confidence provision and verifiable explanations, as LLMs can be unstable or 'hallucinate' reasoning steps
Notes specific challenges in Geometry and Tabular Math (TabMWP), where models must interpret spatial or structured data, unlike pure text-based Arithmetic or MWP
Collaboration between AI and human expertise is increasingly advocated for complex tasks like theorem proving, rather than purely autonomous approaches

📚 Prerequisite Knowledge

Prerequisites

Large Language Models
Mathematical Datasets
Prompt Engineering strategies

Key Terms

MWP: Math Word Problems—mathematical questions presented as natural language narratives rather than pure equations

ATP: Automated Theorem Proving—autonomous construction of logical proofs for mathematical conjectures

Chain-of-Thought: A prompting technique that encourages the model to generate intermediate reasoning steps before the final answer

Program-of-Thought: A method where the LLM generates executable code (like Python) to solve the reasoning steps of a math problem, separating reasoning from computation

TabMWP: A dataset for math word problems requiring reasoning over tabular data contexts (tables, images, or structured text)

Geometry: Problems requiring spatial understanding of shapes, sizes, and interrelationships, often involving visual or symbolic inputs

Fine-tuning: Adjusting the parameters of a pre-trained model on a specific dataset (e.g., math problems) to improve performance

Python REPL: An interactive shell (Read-Eval-Print Loop) that allows the LLM to execute code snippets to verify calculations