← Back to Paper List

Synthetic Data Enhances Mathematical Reasoning of Language Models Based on Artificial Intelligence

Z Han, W Jiang
Georgetown University, Beijing University of Posts and Telecommunications
Information Technology and Control, 2025 (2025)
Reasoning QA Benchmark

📝 Paper Summary

Mathematical Reasoning Synthetic Data Generation Small Language Models (SLMs)
This paper demonstrates that fine-tuning small language models on high-quality, AI-generated synthetic data significantly improves their mathematical reasoning capabilities in linear and abstract algebra at minimal cost.
Core Problem
Training Large Language Models (LLMs) requires massive datasets and computational resources, making it expensive for individuals to develop specialized mathematical models.
Why it matters:
  • High costs of GPU clusters and data collection limit access for individual researchers and smaller organizations
  • General-purpose LLMs often struggle with specific mathematical domains or produce hallucinations in reasoning
  • Existing datasets for specific fields like linear algebra are often limited in size or lack step-by-step reasoning
Concrete Example: A general LLM like ChatGPT-4o might fail to correctly compare numbers (e.g., 9.11 vs 9.9) or lack deep linear algebra reasoning. Standard datasets like Linear Algebra QA have only ~200 examples, insufficient for effective fine-tuning.
Key Novelty
Cost-Effective Synthetic Data Fine-Tuning for Specialized Math
  • Leverages a commercial synthetic data platform (Gretel.ai) to generate thousands of high-quality mathematical QA pairs (definitions, theorems, calculations) from prompt templates without manual collection
  • Integrates 'Chain-of-Thought' style reasoning directly into the synthetic data generation process, teaching models not just the answer but the derivation steps
  • Demonstrates that small, open-source models (SLMs) like Mistral-7B can achieve significant performance gains on specific math tasks using this synthetic data
Evaluation Highlights
  • +18.2% accuracy increase for GPT-3 on the Abstract Algebra benchmark after fine-tuning
  • ~24.0% accuracy increase for GPT-3 on Linear Algebra calculation benchmarks
  • Mistral-7B-v0.1 achieved ~2x accuracy improvement on Linear Algebra calculations after fine-tuning, outperforming larger models like Llama-2-13B
Breakthrough Assessment
6/10
Provides a practical, low-cost recipe for democratizing specialized model training. While the method relies on existing tools (Gretel, OpenAI, AutoTrain), the empirical validation on specific algebra tasks is valuable for practitioners.
×