AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

📝 Paper Summary

Mathematical Reasoning Post-training (Supervised Fine-Tuning) Reward Modeling

AceMath improves math reasoning by initializing models with a general-domain SFT stage before math-specific fine-tuning on heavily filtered synthetic data, coupled with a robust reward model.

Core Problem

Existing math-specialized models often lack general reasoning capabilities needed as a foundation for advanced math, and their training data frequently contains incorrectly generated or unsolvable synthetic prompts.

Why it matters:

Math is a critical testbed for evaluating complex logical reasoning in LLMs, serving as a verifiable proxy for intelligence
Current approaches that fine-tune directly on math data may miss the benefits of strong general instruction-following capabilities
Low-quality synthetic data (unsolvable problems, style biases) hampers the reliability of both generation and reward models

Concrete Example: Synthetic prompt evolution often introduces constraints that make a problem unsolvable (e.g., adding contradictory conditions to a geometry problem). Training on such data, where the model forces an answer to an impossible question, degrades reasoning logic. AceMath explicitly filters these unsolvable synthetic prompts.

Key Novelty

Two-Stage SFT Strategy & Robust Reward Modeling

Implements a 'General SFT' stage first to build instruction-following and coding foundations, creating a better initialization for subsequent 'Math SFT'
Constructs math SFT data using a rigorous cross-check pipeline where solutions are verified by agreement between two different models (GPT-4o-mini and Qwen2.5-Math)
Introduces AceMath-RewardBench to evaluate reward models on diverse response styles and difficulty levels

Architecture

The data curation and training workflow for AceMath, detailing the sources and filtering of General and Math SFT data.

Evaluation Highlights

AceMath-72B-Instruct achieves 71.8 average score across math benchmarks, outperforming Qwen2.5-Math-72B-Instruct (68.2) by 3.6 points
AceMath-7B-Instruct (67.2) surpasses the baseline Qwen2.5-Math-7B-Instruct (62.9) by 4.3 points and nearly matches the 10x larger Qwen2.5-Math-72B
Combining AceMath-72B-Instruct with AceMath-72B-RM achieves the highest average rm@8 score across seven math reasoning benchmarks

Breakthrough Assessment

8/10

Significant performance gains over state-of-the-art open-weights models (Qwen2.5-Math) using purely data-centric and post-training innovations. The release of a specialized reward benchmark is also a valuable contribution.

⚙️ Technical Details

Problem Definition

Setting: Mathematical reasoning and problem solving via text generation and verification

Inputs: Natural language math problem P

Outputs: Step-by-step reasoning solution S and final answer A

Pipeline Flow

Data Curation (Prompt Collection & Synthetic Generation)
Response Generation & Filtering (Cross-checking)
Stage 1: General SFT Training
Stage 2: Math SFT Training
Inference: Generation + Reward Model Verification

System Modules

Data Curator (Data Preparation)

Aggregates prompts from General, Coding, and Math domains; generates synthetic math prompts via evolution

Model or implementation: GPT-4o-mini (for prompt evolution)

Response Generator (Teacher) (Data Preparation)

Generates synthetic solutions for the collected prompts to create SFT pairs

Model or implementation: Qwen2.5-Math-72B-Instruct and GPT-4o-mini

AceMath-Instruct

Generates solutions to math problems

Model or implementation: Fine-tuned versions of Qwen2.5-Math (1.5B/7B/72B), Llama3.1, or DeepSeek

AceMath-RM

Scores generated solutions to identify the correct one

Model or implementation: AceMath-72B-RM (Reward Model)

Novel Architectural Elements

Two-stage SFT pipeline: Explicitly training on General+Code domains (Stage 1) before Math (Stage 2) to prime reasoning capabilities
Cross-model verification protocol for synthetic data: Validating Qwen2.5-Math solutions against GPT-4o-mini consensus before adding to training set

Modeling

Base Model: Qwen2.5-Math-72B (and 1.5B/7B variants), Llama-3.1-8B, DeepSeek-Coder-7B

Training Method: Supervised Fine-Tuning (SFT) in two stages

Objective Functions:

Purpose: Minimize the negative log-likelihood of the target response tokens.

Formally: Standard Cross-Entropy Loss on response tokens only.

Training Data:

Stage 1 (General SFT): ~2M samples (1.2M Code, 0.7M General, 1.2M Math subset)
Stage 2 (Math SFT): 1.6M samples (NuminaMath + Synthetic subset + 800K cross-checked high-quality samples)

Key Hyperparameters:

learning_rate_general_sft: 5e-6
learning_rate_math_sft: 3e-6
global_batch_size: 128 (256 for 72B model)
+ 3 more
optimizer: AdamW
max_sequence_length: 4096
epochs: 1

Compute: Not reported in the paper

Comparison to Prior Work

vs. Qwen2.5-Math-Instruct: AceMath adds a 'General SFT' initialization stage and uses stricter synthetic data filtering (cross-checking), resulting in higher performance on the same base model
vs. OpenMathInstruct: AceMath uses a more complex 2-stage pipeline and incorporates coding data to boost reasoning [not cited in paper as direct methodology comparison, but contextual]
vs. GPT-4o: AceMath-7B (a small open model) achieves comparable performance to GPT-4o on several math benchmarks

Limitations

Dependency on strong teacher models (GPT-4o-mini, Qwen2.5-Math-72B) for synthetic data generation and verification
Small models (1.5B) still benefit significantly from math-continual pre-training, whereas larger models (72B) show diminishing returns from the 'Math Base' initialization
Evaluation on highly challenging competition benchmarks (AMC, AIME) is limited to small sample sizes (30-40 samples)
Paper snippet provided ends before detailing the Reward Model training specifics (Section 4)

Reproducibility

Code: https://research.nvidia.com/labs/adlr/acemath

Available: Model weights, training data, and AceMath-RewardBench are released at https://research.nvidia.com/labs/adlr/acemath. Missing: Specific compute hours/hardware used for training (only batch sizes/LR provided). The paper snippet provided cuts off before Section 4, so RM training specifics are not fully detailed here.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on most benchmarks; 5-shot for MMLU/MMLU-STEM.

Benchmarks:

GSM8K (Grade school math)
MATH (Challenging competition math)
AceMath-RewardBench (Reward model evaluation) [New]
Minerva Math (General math reasoning)
Olympiad Bench (Olympiad-level math)
HumanEval (Python coding)

Metrics:

Pass@1 (Accuracy)
rm@8 (Best-of-N accuracy using Reward Model)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison shows AceMath-Instruct models consistently outperforming their direct Qwen2.5-Math-Instruct baselines across varying parameter scales.
Average (Math Benchmarks)	Score	68.2	71.8	+3.6
Average (Math Benchmarks)	Score	62.9	67.2	+4.3
Average (Math Benchmarks)	Score	51.3	57.7	+6.4
Ablation studies validate the two-stage training strategy (General -> Math) and data filtering choices.
Average (Math & Code)	Score	46.3	49.7	+3.4
Average (Math Benchmarks)	Score	70.9	71.8	+0.9

Experiment Figures

Performance comparison (Pass@1) of AceMath-Instruct models against Qwen2.5-Math, GPT-4o, and Claude-3.5 Sonnet across multiple benchmarks.

Analysis of Backbone Model impact: Comparing AceMath trained on 'Base' models vs. 'Math-Base' (continually pre-trained) models.

Main Takeaways

Initializing with General SFT (Code+General Knowledge) before Math SFT consistently improves downstream math reasoning, likely by establishing better instruction-following and logic foundations.
Data Quality > Quantity: Using a smaller set of cross-checked synthetic samples (verified by agreement between GPT-4o-mini and Qwen) outperforms using all generated samples.
For large models (72B), the gap between using a general base model and a math-continually-pretrained base model narrows, suggesting scale inherently covers some domain adaptation needs.
Small models (1.5B-7B) benefit most dramatically from the AceMath pipeline, achieving performance competitive with much larger baselines.

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-Tuning (SFT) pipelines
Synthetic data generation for LLMs
Reward Modeling / Verifiers
Chain-of-Thought (CoT) reasoning

Key Terms

SFT: Supervised Fine-Tuning—training a pre-trained model on labeled examples (prompts and responses) to follow instructions

General SFT: The first training stage in this paper, focusing on diverse topics (coding, general knowledge) to build a reasoning foundation

Math SFT: The second training stage, focusing exclusively on mathematical problem-solving data

rm@8: A metric evaluating the performance of 'Best-of-N' sampling; specifically, generating 8 solutions and selecting the best one using a reward model

pass@1: The accuracy of the model when it generates a single response to a problem

In-breadth evolution: A synthetic data generation technique that creates new prompts by varying the topic or setting of a seed prompt while maintaining similar difficulty

Data decontamination: Removing training samples that overlap significantly with the test set to prevent the model from memorizing answers

Outcome reward model: A model trained to predict whether a final answer is correct, often used to rerank generated solutions

Greedy decoding: A generation strategy where the model always picks the most likely next token, resulting in a deterministic output