Synthetic Data Enhances Mathematical Reasoning of Language Models Based on Artificial Intelligence

📝 Paper Summary

Mathematical Reasoning Synthetic Data Generation Small Language Models (SLMs)

This paper demonstrates that fine-tuning small language models on high-quality, AI-generated synthetic data significantly improves their mathematical reasoning capabilities in linear and abstract algebra at minimal cost.

Core Problem

Training Large Language Models (LLMs) requires massive datasets and computational resources, making it expensive for individuals to develop specialized mathematical models.

Why it matters:

High costs of GPU clusters and data collection limit access for individual researchers and smaller organizations
General-purpose LLMs often struggle with specific mathematical domains or produce hallucinations in reasoning
Existing datasets for specific fields like linear algebra are often limited in size or lack step-by-step reasoning

Concrete Example: A general LLM like ChatGPT-4o might fail to correctly compare numbers (e.g., 9.11 vs 9.9) or lack deep linear algebra reasoning. Standard datasets like Linear Algebra QA have only ~200 examples, insufficient for effective fine-tuning.

Key Novelty

Cost-Effective Synthetic Data Fine-Tuning for Specialized Math

Leverages a commercial synthetic data platform (Gretel.ai) to generate thousands of high-quality mathematical QA pairs (definitions, theorems, calculations) from prompt templates without manual collection
Integrates 'Chain-of-Thought' style reasoning directly into the synthetic data generation process, teaching models not just the answer but the derivation steps
Demonstrates that small, open-source models (SLMs) like Mistral-7B can achieve significant performance gains on specific math tasks using this synthetic data

Evaluation Highlights

+18.2% accuracy increase for GPT-3 on the Abstract Algebra benchmark after fine-tuning
~24.0% accuracy increase for GPT-3 on Linear Algebra calculation benchmarks
Mistral-7B-v0.1 achieved ~2x accuracy improvement on Linear Algebra calculations after fine-tuning, outperforming larger models like Llama-2-13B

Breakthrough Assessment

6/10

Provides a practical, low-cost recipe for democratizing specialized model training. While the method relies on existing tools (Gretel, OpenAI, AutoTrain), the empirical validation on specific algebra tasks is valuable for practitioners.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning pre-trained language models on domain-specific mathematical Question-Answering (QA) tasks

Inputs: Mathematical questions q involving definitions, theorems, or calculations (e.g., 'Find the eigenvalues of matrix A')

Outputs: Predicted answer Ans_predict accompanied by reasoning steps

Pipeline Flow

Prompt Design (create templates for math problems)
Navigator Model (Gretel.ai selects optimal generation model)
Synthetic Data Generation (Batch generation of QA pairs)
Fine-tuning (Train SLMs/LLMs on synthetic data)

System Modules

Synthetic Data Generator

Generate high-quality mathematical QA datasets

Model or implementation: Gretel-LLAMA-3.1-8B-Instruct (for Linear Algebra) / Gretel GPT-3.5 Turbo (for Abstract Algebra)

Math SLM/LLM

Solve mathematical problems with reasoning

Model or implementation: Mistral-7B-v0.1 / Llama-2-7B/13B / GPT-3.5-Turbo

Modeling

Base Model: Mistral-7B-v0.1, Llama-2-7B/13B, Bloom-7B1, GPT-3.5-Turbo

Training Method: Supervised Fine-Tuning (SFT)

Adaptation: LoRA (Low-Rank Adaptation) for open-source models; OpenAI Fine-tuning API for GPT-3.5

Trainable Parameters: Not reported in the paper

Training Data:

5000 Linear Algebra Theorem problems (Synthetic)
3000 Abstract Algebra problems (Synthetic)
1000 Linear Algebra Calculation problems (Synthetic)

Key Hyperparameters:

learning_rate: 0.00003
batch_size: 2 (Autotrain default) / 3 (Mistral) / 6 (GPT-3.5)
epochs: 3 (GPT-3.5, Llama-2) / 4 (Mistral)
+ 4 more
mixed_precision: fp16
lora: True
block_size: 1024
gradient_accumulation: 4

Compute: NVidia 1xL40S 8 vCPUs and 62GB memory (Google Colab A100 also mentioned for eval). Fine-tuning costs: ~$1 for SLMs via AutoTrain, ~$5.53 for GPT-3.5 via OpenAI API.

Comparison to Prior Work

vs. AlpaGasus: Skips instruction filtering, directly generates high-quality synthetic data to save costs
vs. MAmmoTH: Focuses on domain-specific (linear/abstract algebra) fine-tuning for individuals rather than general math reasoning
vs. MathBERT: Uses generative fine-tuning rather than pre-training on formula structures
+ 1 more
vs. o1-mini [not cited in paper]: Provides a cheaper, self-hostable alternative for specific domains compared to expensive proprietary reasoning models

Limitations

Potential data bias: Synthetic data inherits biases from the generator models (Gretel-Llama-3, GPT-3.5)
Limited scope: Evaluated only on Linear Algebra and Abstract Algebra, not broader math domains like topology or calculus
Black-box generation: Reliance on closed-source Gretel.ai models reduces transparency of the data generation process
Small test sets: Some benchmarks like Linear Algebra QA are very small (223 rows), potentially affecting evaluation robustness

Reproducibility

Code: https://github.com/DinoZeyu/LLM-Research.git

Publicly available: code repository, fine-tuned SLM weights on HuggingFace. Synthetic datasets described but not explicitly linked as a downloadable file (though generation prompts are discussed). Platform dependencies: Gretel.ai (commercial, free tier used), OpenAI API, Hugging Face AutoTrain.

📊 Experiments & Results

Evaluation Setup

Question Answering on mathematical definitions, theorems, and calculations

Benchmarks:

Linear Algebra QA (Linear Algebra questions)
MATH (Linear Algebra subset) (Eigenvalue and determinant problems)
MMLU (Abstract Algebra subset) (Abstract algebra multiple choice)
TheoremQA (Linear Algebra subset) (Theorem application)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GPT-3.5 Turbo fine-tuning results show significant improvements across abstract and linear algebra tasks when trained on synthetic data.
MMLU Abstract Algebra	Accuracy	22.00	40.20	+18.20
Linear Algebra Theorem QA	Accuracy	9.62	25.00	+15.38
Linear Algebra Calculation	Accuracy	5.83	29.83	+24.00
Small Language Model (SLM) results demonstrate that smaller, open models can also improve significantly, with Mistral-7B outperforming others.
Linear Algebra Calculation	Accuracy	9.62	19.68	+10.06
Linear Algebra QA	Accuracy	31.84	35.50	+3.66
Linear Algebra Calculation	Accuracy	5.83	12.78	+6.95

Main Takeaways

Synthetic data effectively enhances mathematical reasoning: Fine-tuning on AI-generated data yielded consistent accuracy gains across all models and benchmarks.
Mistral-7B outperforms Llama-2-13B: Despite being smaller, Mistral-7B showed superior performance and fine-tuning efficiency, achieving the best results among SLMs.
Cross-domain generalization: Fine-tuning GPT-3.5 solely on Abstract Algebra data surprisingly improved its performance on Linear Algebra tasks, suggesting transfer of foundational mathematical reasoning.
Cost-efficiency: Fine-tuning SLMs cost ~$1 via AutoTrain, demonstrating that individuals can create specialized math models without expensive infrastructure.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and fine-tuning
Basic concepts of Linear Algebra and Abstract Algebra
Familiarity with synthetic data generation

Key Terms

SLM: Small Language Model—models with fewer parameters (typically <13B) that are cheaper to run and fine-tune

LLM: Large Language Model—models with massive parameter counts (e.g., GPT-3, GPT-4) trained on vast datasets

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

Chain-of-Thought: A prompting technique where the model generates intermediate reasoning steps before the final answer

AutoTrain: A Hugging Face tool that simplifies the process of training and fine-tuning machine learning models without extensive coding

Gretel.ai: A platform for generating synthetic data, used here to create mathematical QA pairs

MMLU: Massive Multitask Language Understanding—a benchmark designed to measure knowledge acquired during pretraining

Hallucination: When a language model generates plausible-sounding but factually incorrect information