AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset

📝 Paper Summary

Mathematical Reasoning Synthetic Data Generation Tool-Integrated Reasoning

OpenMath-Nemotron achieves state-of-the-art math reasoning by training on a massive synthetic dataset of tool-integrated solutions and employing a generative model to select the best answer from multiple candidates.

Core Problem

Strong open-weight reasoning models like DeepSeek-R1 struggle to integrate code execution (Tool-Integrated Reasoning) via prompting alone, and standard majority voting fails to bridge the gap between pass@1 and theoretical pass@k performance.

Why it matters:

Pure text reasoning often fails on complex calculations where code is reliable, but models trained only on text reasoning resist using tools even when instructed
Majority voting requires generating many expensive solutions and treats them independently, ignoring the model's ability to critically evaluate and compare reasoning traces
Existing competition-level math benchmarks (AIMO) have strict time limits, requiring efficient inference strategies rather than massive brute-force sampling

Concrete Example: When prompted to use Python, models like DeepSeek-R1 often refuse or write code only to verify trivial arithmetic. In contrast, the proposed TIR model autonomously writes code to perform exhaustive searches or use numeric solvers for problems where analytical solutions are infeasible.

Key Novelty

OpenMath-Nemotron & GenSelect

Created OpenMathReasoning: a massive dataset of 540K problems with 3.2M Chain-of-Thought and 1.7M Tool-Integrated Reasoning solutions generated via iterative rejection sampling
Developed an iterative 'training-generation-filtering' loop to force instruction-following models to produce high-quality code-integrated reasoning, which is then used to train base reasoning models
Generative Solution Selection (GenSelect): Instead of a scalar score, a model is trained to read multiple candidate summaries and generate a reasoning trace concluding with the best solution

Architecture

The Data Construction Pipeline for GenSelect (Generative Solution Selection).

Evaluation Highlights

OpenMath-Nemotron-32B (TIR + Self GenSelect) achieves 93.3% accuracy on the Comp-Math-24-25 benchmark, significantly outperforming DeepSeek-R1 (79.1%)
Winning submission for AIMO-2 Kaggle competition, solving 34/50 private test problems (1st place)
OpenMath-Nemotron-7B with GenSelect (86.7%) outperforms the much larger DeepSeek-R1 (79.1%) on the Comp-Math-24-25 benchmark

Breakthrough Assessment

9/10

Establishes a new SOTA for open-weight math models, releases a massive high-quality dataset (OpenMathReasoning), and demonstrates a viable recipe for fusing code execution with reasoning models.

⚙️ Technical Details

Problem Definition

Setting: Mathematical problem solving with optional code execution capability

Inputs: Natural language math problem P

Outputs: Final answer A, potentially reached via interleaved natural language reasoning and Python code execution

Pipeline Flow

Problem Input
Solution Generation (CoT or TIR mode)
Solution Summarization (Optional for GenSelect)
Generative Selection (GenSelect) or Majority Voting
Final Answer Output

System Modules

Generator

Generate reasoning traces (either text-only CoT or text+code TIR) to solve the problem

Model or implementation: OpenMath-Nemotron (1.5B/7B/14B/32B)

Summarizer (Selection)

Create concise summaries of generated solution traces for the selector to evaluate

Model or implementation: Qwen2.5-32B-Instruct (during data creation) or OpenMath-Nemotron (inference)

Selector (GenSelect) (Selection)

Compare multiple solution summaries and reason about which is most likely correct

Model or implementation: OpenMath-Nemotron (same model as generator, different prompt)

Novel Architectural Elements

Unified model supporting CoT, TIR, and GenSelect tasks via prompt switching (multi-task SFT)
GenSelect mechanism treating verification as a generative reasoning task over solution summaries rather than scalar reward prediction

Modeling

Base Model: Qwen2.5-Base (1.5B, 7B, 14B, 32B). 1.5B and 7B initialized from Qwen2.5-Math variants.

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Maximize likelihood of correct reasoning traces (CoT, TIR) and correct selections (GenSelect).

Formally: Standard cross-entropy loss on target tokens.

Adaptation: Full fine-tuning

Trainable Parameters: All parameters

Training Data:

OpenMathReasoning Dataset: 540K problems
3.2M CoT solutions (distilled from DeepSeek-R1/QwQ-32B)
1.7M TIR solutions (iterative generation/filtering from LIMO-Qwen-32B/QwQ-32B)
566K GenSelect examples (generated by QwQ-32B)

Key Hyperparameters:

learning_rate: 1e-4 (14B/32B), 2e-4 (7B), 3e-4 (1.5B)
batch_size: 1024
epochs: 6 (Round 1), 4 (Round 2 on hard subset)
+ 3 more
weight_decay: 0.01
scheduler: Cosine with 10% linear warmup
rope_base: 500000

Compute: Not reported in the paper (implies large-scale cluster usage via NeMo-Aligner)

Comparison to Prior Work

vs. DeepSeek-R1: OpenMath-Nemotron integrates Code (TIR) natively, whereas DeepSeek-R1 is text-focused and struggles with tool use prompts.
vs. QwQ-32B: OpenMath-Nemotron adds GenSelect capability to self-verify answers, improving over majority voting.
vs. DeepSeek-R1-Distill-Qwen-32B: OpenMath-Nemotron-32B (TIR+GenSelect) achieves 93.3% vs 66.9% on Comp-Math-24-25, showing the value of TIR and GenSelect over pure distillation.

Limitations

GenSelect inference becomes unstable with >32 candidate solutions due to context length limits.
Smaller models (1.5B) are less consistent in using tools effectively, leading to higher rates of unfinished TIR solutions.
High variance in Kaggle leaderboard scores made it difficult to validate improvements during the competition.
TIR models produce longer generations than CoT models, impacting inference time constraints.

Reproducibility

Code: https://github.com/NVIDIA/NeMo-Skills

Highly reproducible. Code, models (OpenMath-Nemotron 1.5B-32B), and the full OpenMathReasoning dataset are released under a commercially permissive license. Kaggle submission code and TensorRT-LLM optimizations are described.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning on competition-level problems

Benchmarks:

Comp-Math-24-25 (Math Competition Problems (AIME 2024/25, HMMT 2024/25)) [New]
HLE-Math (Text-only subset of Humanity's Last Exam (Math category))

Metrics:

Accuracy (Pass@1)
Majority Voting Accuracy (Maj@64)
GenSelect Accuracy (selecting best of N)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
TIR and GenSelect significantly outperform baselines on the Comp-Math-24-25 benchmark.
Comp-Math-24-25	Accuracy	79.1	93.3	+14.2
Comp-Math-24-25	Accuracy	78.1	93.3	+15.2
Comp-Math-24-25	Accuracy	54.4	86.7	+32.3
Comp-Math-24-25	Accuracy	76.3	86.7	+10.4
Comp-Math-24-25	Accuracy	76.3	65.8	-10.5

Experiment Figures

Accuracy of CoT vs TIR models as a function of the number of samples used for Majority/GenSelect.

Main Takeaways

Generative Solution Selection (GenSelect) consistently outperforms Majority Voting, bridging the gap towards theoretical pass@k limits.
Tool-Integrated Reasoning (TIR) significantly boosts performance over Chain-of-Thought (CoT) for larger models, though smaller models (1.5B) struggle with tool consistency.
Linear merging of CoT and TIR checkpoints (CoT*0.3 + TIR*0.7) provided the best balance of accuracy and generation speed for the Kaggle submission.
Smaller models (7B) trained with the OpenMath recipe can outperform significantly larger models (DeepSeek-R1 671B) on specific competition math benchmarks.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Supervised Fine-Tuning (SFT)
Rejection Sampling / Majority Voting
Model Merging

Key Terms

TIR: Tool-Integrated Reasoning—interleaving natural language thought with executable code blocks (e.g., Python) to solve sub-problems

CoT: Chain-of-Thought—a prompting method where the model generates intermediate reasoning steps before the final answer

GenSelect: Generative Solution Selection—a method where a model is presented with multiple candidate solution summaries and generates a reasoning trace to select the best one

RoPE: Rotary Positional Embeddings—a technique for encoding position information in Transformers; here, the base frequency is scaled to support longer context windows

Pass@k: The probability that at least one of the k generated solutions is correct

Maj@k: The accuracy obtained by taking the most frequent answer (majority vote) among k generated solutions

SFT: Supervised Fine-Tuning—training a pre-trained model on a labeled dataset

Speculative Decoding: An inference acceleration technique where a small 'drafter' model proposes tokens that are verified by the larger target model

Model Merging: Combining the weights of two different fine-tuned models (e.g., via linear interpolation) to blend their capabilities