WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

📝 Paper Summary

Mathematical Reasoning Reinforcement Learning from Feedback

WizardMath improves mathematical reasoning in LLMs by fine-tuning on diverse, evolved instructions and optimizing via reinforcement learning using both instruction-quality and step-by-step process supervision.

Core Problem

Open-source LLMs struggle with complex multi-step mathematical reasoning compared to proprietary models like GPT-4, and standard supervised fine-tuning often suffers from hallucinations or false positives where incorrect reasoning leads to correct answers.

Why it matters:

Mathematical reasoning requires rigorous logic that standard pre-training on internet data often fails to instill
Existing open-source models lag significantly behind closed-source models (GPT-4, Claude 2) in quantitative reasoning tasks
Outcome-based supervision (checking only the final answer) is insufficient for ensuring correct logical steps in complex problem solving

Concrete Example: In a multi-step math problem, a model might make two calculation errors that cancel each other out, arriving at the correct final answer (False Positive). Standard outcome supervision rewards this, reinforcing bad logic, whereas WizardMath's process supervision detects and penalizes the incorrect intermediate steps.

Key Novelty

Reinforcement Learning from Evol-Instruct Feedback (RLEIF)

Extends Evol-Instruct to mathematics by adding 'Downward Evolution' (simplifying problems) alongside 'Upward Evolution' (complicating problems) to create a diverse difficulty curriculum
Introduces an Instruction Reward Model (IRM) to judge the quality/definition of evolved questions and a Process-Supervised Reward Model (PRM) to verify reasoning steps
Integrates these into a PPO (Proximal Policy Optimization) loop where the reward is a product of instruction quality and solution correctness

Architecture

Overview of the RLEIF method. It shows the pipeline from Math Evol-Instruct to Reward Model training (IRM & PRM) and finally PPO training.

Evaluation Highlights

WizardMath-70B outperforms GPT-4 (early version), Claude-2, and Gemini Pro on GSM8K (81.6 pass@1) and MATH (22.7 pass@1) benchmarks
WizardMath-Mistral-7B achieves +12.8% improvement on GSM8k and +26.8% on MATH compared to MetaMath-Mistral-7B
Combining PRM and IRM yields a 6-8% improvement over SFT baselines, validating the RLEIF approach

Breakthrough Assessment

8/10

Significant jump in open-source math reasoning performance, surpassing much larger proprietary models. The combination of evolutionary data generation with process supervision (RLEIF) is a strong methodological contribution.

⚙️ Technical Details

Problem Definition

Setting: Mathematical Chain-of-Thought (CoT) reasoning

Inputs: Math instruction q (e.g., a word problem)

Outputs: Step-by-step solution a

Pipeline Flow

Math Evol-Instruct (Generate Data)
Instruction Reward Model (IRM) & Process Reward Model (PRM) Training
Reinforcement Learning (PPO) Optimization

System Modules

Math Evol-Instruct

Generate diverse training data with varying difficulty

Model or implementation: GPT-4

Instruction Reward Model (IRM) (Reward Modeling)

Predict the quality of evolved instructions

Model or implementation: Llama 2 / Mistral-7B (Fine-tuned)

Process-Supervised Reward Model (PRM) (Reward Modeling)

Verify correctness of each reasoning step

Model or implementation: Llama 2 / Mistral-7B (Fine-tuned)

Policy Model (WizardMath)

Generate solutions to math problems

Model or implementation: Llama 2 (7B, 13B, 70B) / Mistral-7B

Novel Architectural Elements

Dual-reward PPO integration: The final reward is a product of instruction quality (IRM) and solution correctness (PRM)
Math Evol-Instruct paradigm: Explicitly separating evolution into 'Downward' (simplification) and 'Upward' (complexification) tracks

Modeling

Base Model: Llama 2 (7B, 13B, 70B) and Mistral-7B

Training Method: PPO (Proximal Policy Optimization) with RLEIF

Objective Functions:

Purpose: Judge quality of evolved instructions.

Formally: Pairwise ranking loss on instruction pairs ranked by GPT-4
Purpose: Verify step correctness.

Formally: Cross-entropy loss on step-level labels (correct/incorrect) provided by GPT-4
Purpose: Optimize policy.

Formally: PPO objective maximizing combined reward R = r_q * r_a, where r_q is IRM score and r_a is min(PRM step scores)

Training Data:

SFT Data: 418k evolved instructions (from GSM8k/MATH seeds) with GPT-4 solutions
IRM Data: 90k evolved instructions ranked by GPT-4
PRM Data: Generated answers labeled step-by-step by GPT-4

Key Hyperparameters:

learning_rate_sft: 2e-5 (7B/13B), 5e-6 (70B/Mistral)
learning_rate_rm: 4e-6 (Llama), 1e-6 (Mistral)
learning_rate_rl: 4e-7 (Llama), 1e-7 (Mistral)
+ 3 more
batch_size: 512
epochs_sft: 3
epochs_rm_rl: 1

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. MetaMath: WizardMath uses Reinforcement Learning (RLEIF) and Process Supervision (PRM) rather than just SFT on augmented data
vs. MAmmoTH: WizardMath focuses purely on CoT reasoning without external Python tools
vs. Llemma: WizardMath is a fine-tuning/RL alignment method, not a continual pre-training method

Limitations

Dependency on GPT-4 for data generation and reward labeling (distillation)
Process supervision relies on AI (GPT-4) labeling rather than human ground truth, potentially inheriting errors
Does not use external tools (Python), which might limit performance on purely computational tasks compared to tool-augmented models

Reproducibility

Code: https://github.com/nlpxucan/WizardLM

Model weights available at https://github.com/nlpxucan/WizardLM. Training data construction process is detailed, but exact training datasets (SFT/RM/RL) are not explicitly linked in the paper text (though likely in repo). Uses GPT-4 for all data synthesis/labeling.

📊 Experiments & Results

Evaluation Setup

Zero-shot Chain-of-Thought (CoT) evaluation without external tools

Benchmarks:

GSM8k (Grade school math word problems)
MATH (Challenging math problems (high school/competition level))

Metrics:

pass@1 accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GSM8k	pass@1	82.3	81.6	-0.7
MATH	pass@1	26.6	22.7	-3.9
GSM8k	pass@1	77.9	90.7	+12.8
MATH	pass@1	28.6	55.4	+26.8
GSM8k	pass@1	77.1	81.6	+4.5
GSM8k	pass@1	87.2	95.2	+8.0

Main Takeaways

WizardMath-Mistral-7B achieves state-of-the-art among open-source models, outperforming much larger models like Llama-2-70B baselines.
Data efficiency is superior: Math Evol-Instruct achieves better results with less data compared to MetaMathQA.
Combining PRM and IRM in PPO is crucial: PRM alone gives ~3-4% gain, adding IRM gives another ~2.5-4%.
Process supervision (step-by-step verification) significantly outperforms outcome supervision (ORM) for verifiers.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Chain-of-Thought (CoT) Prompting
Proximal Policy Optimization (PPO)

Key Terms

Evol-Instruct: A method using LLMs to automatically generate complex and diverse instructions by iteratively rewriting them (e.g., adding constraints, deepening reasoning)

RLEIF: Reinforcement Learning from Evol-Instruct Feedback—the paper's proposed method combining evolutionary data generation with RL optimization

PRM: Process-supervised Reward Model—a reward model that scores the correctness of each individual step in a reasoning chain, rather than just the final answer

IRM: Instruction Reward Model—a reward model trained to predict the quality (difficulty and clarity) of the mathematical instructions themselves

PPO: Proximal Policy Optimization—a reinforcement learning algorithm used to update the language model policy

GSM8k: A benchmark dataset of high-quality grade school math word problems

MATH: A benchmark dataset of challenging mathematics problems (algebra, geometry, calculus, etc.)

SFT: Supervised Fine-Tuning

False-Positive: A scenario in reasoning where the final answer is correct but the intermediate reasoning steps contain errors