AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

📝 Paper Summary

Post-training optimization Reasoning models

Scaling SFT data (prompts and responses) and carefully tuning RL temperature enables a 7B model to achieve state-of-the-art math and code reasoning performance.

Core Problem

Systematic studies on the interplay between SFT and RL for reasoning are limited, specifically regarding data scaling strategies and how SFT strength influences final RL performance.

Why it matters:

Current frontier models use large-scale RL but lack technical transparency on the synergy with SFT.
It is unclear whether stronger SFT models consistently yield better RL outcomes or if improvements plateau.
Determining the optimal exploration-exploitation balance (temperature) during RL remains a heuristic process without clear guidelines.

Concrete Example: When RL training is applied to a weak SFT model versus a strong one, it is often assumed the gap closes; however, this paper finds the stronger SFT model maintains a lead, necessitating better SFT initialization strategies.

Key Novelty

Synergistic Post-Training Recipe (AceReason-Nemotron 1.1)

Demonstrates that scaling SFT prompts yields higher gains than scaling responses per prompt, and that stronger SFT consistently leads to better RL outcomes.
Introduces a temperature selection rule for RL: setting sampling temperature to maintain temperature-adjusted entropy around 0.3 balances exploration and exploitation.
Proposes a curriculum where 'overlong filtering' (masking samples exceeding token budgets) is beneficial only for short budgets (8K/16K) but harmful for long budgets (24K/32K).

Architecture

The overall training pipeline including Data Curation, SFT Scaling, and Multi-stage RL.

Evaluation Highlights

Achieves state-of-the-art performance among Qwen2.5-7B-based models on math and code benchmarks.
Scaling SFT prompts from 36K to 247K (math) significantly improves pass@1 performance.
RL training narrows the gap between weak and strong SFT models but preserves the rank order, with stronger SFT starts yielding higher final peaks.

Breakthrough Assessment

7/10

Provides a comprehensive, empirical recipe for post-training reasoning models, offering concrete scaling laws for SFT and practical heuristics for RL temperature, achieving SOTA for its size class.

⚙️ Technical Details

Problem Definition

Setting: Math and Code Reasoning via Chain-of-Thought (CoT) generation

Inputs: Natural language problem statements (math or code)

Outputs: Long-context reasoning chain followed by a final answer

Pipeline Flow

Data Curation (Collecting prompts -> Deduplication -> Filtering -> Generating responses via DeepSeek-R1)
Supervised Fine-Tuning (SFT) on scaled data
Math-only RL (Stage 1: 8K -> Stage 2: 16K -> Stage 3: 24K)
Code-only RL (Stage I: 24K -> Stage II: 32K)
Math-only RL (Stage 4: 32K)

System Modules

SFT Model

Initialize the policy with strong reasoning capabilities via imitation learning

Model or implementation: Qwen2.5-Math-7B (modified rope_theta for 128K context)

RL Policy (Math/Code)

Optimize reasoning chains using rule-based rewards (correctness)

Model or implementation: Initialized from SFT Model

Novel Architectural Elements

Synergistic multi-stage RL curriculum: Interleaves math and code stages with increasing response length budgets (8K to 32K) to stabilize training and enable generalization.

Modeling

Base Model: Qwen2.5-Math-7B

Training Method: GRPO (Group Relative Policy Optimization)

Objective Functions:

Purpose: Maximize reward for correct answers while regularizing against the reference policy.

Formally: Token-level policy gradient loss on advantage A_{i,t}, where A_{i,t} is derived from group-normalized rewards.

Adaptation: Full fine-tuning

Training Data:

SFT Data: 383K unique prompts (247K math, 136K code) expanded to 2.2M samples via multiple responses
RL Data: High-quality math and code prompts filtered for difficulty (neither too easy nor too hard)

Key Hyperparameters:

global_batch_size: 128 (prompts)
group_size_G: 8 or 16
rope_theta: 1,000,000
+ 2 more
context_length: 128K
temperature-adjusted entropy target: 0.3

Comparison to Prior Work

vs. DeepSeek-R1-Distill-Qwen: Incorporates large-scale RL after SFT, not just SFT distillation
vs. AceReason-Nemotron-1.0: Starts from a much stronger, scaled SFT base; extends context to 32K; refines temperature and filtering strategies
vs. OpenMathReasoning [cited in paper]: Focuses on synergy of SFT+RL rather than just SFT data synthesis

Limitations

Computational cost of generating massive synthetic SFT data (up to 64 responses per prompt) is high.
RL training requires careful tuning of temperature, which may vary by dataset.
Overlong filtering strategy effectiveness is dependent on the specific token budget (harmful at 32K).

Reproducibility

Code: https://huggingface.co/nvidia/AceReason-Nemotron-1.1-7B

Publicly available: Model and data released at https://huggingface.co/nvidia/AceReason-Nemotron-1.1-7B. Code availability not explicitly mentioned beyond general veRL framework usage. RL hyperparameters provided (temperature/entropy heuristic).

📊 Experiments & Results

Evaluation Setup

Evaluation on standard math and code reasoning benchmarks using pass@1.

Benchmarks:

MATH (Mathematics problems)
GSM8K (Grade school math)
HumanEval (Python coding problems)
MBPP (Basic python programming)
LiveCodeBench (Code generation from contests)

Metrics:

Pass@1 accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of AceReason-Nemotron-1.1 against baseline Qwen2.5-Math-7B and other distilled models demonstrates SOTA performance.
MATH	Pass@1	82.8	93.4	+10.6
HumanEval	Pass@1	88.4	90.2	+1.8
SFT Scaling experiments show improvements from increasing prompts and responses per prompt.
MATH	Pass@1	78.5	86.1	+7.6

Experiment Figures

Scaling laws for SFT: Accuracy on MATH and LiveCodeBench vs. number of prompts and responses per prompt.

Main Takeaways

Stronger SFT models (obtained via data scaling) consistently lead to better final RL performance, though the gap narrows.
Scaling the number of unique prompts is more effective than scaling the number of responses per prompt for SFT.
RL training benefits from 'overlong filtering' (masking truncated responses) only at shorter budgets (8K/16K); at 32K, it degrades performance.
A specific temperature-adjusted entropy of ~0.3 is optimal for RL exploration.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Human Feedback (RLHF) concepts
PPO and GRPO algorithms
Chain-of-Thought (CoT) prompting

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt to reduce variance, avoiding the need for a separate value network.

SFT: Supervised Fine-Tuning—training a model on labeled examples (prompt-response pairs) to instill desired behaviors before RL.

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer.

pass@K: A metric measuring the probability that at least one of K generated solutions is correct.

temperature-adjusted entropy: A metric used to monitor the randomness of the model's policy during RL, calculated as entropy divided by the sampling temperature.

overlong filtering: A strategy during RL where samples that fail to produce a final answer within the token budget are masked out (ignored) rather than penalized.

exposure bias: The discrepancy between training (where the model sees ground truth) and inference (where it sees its own predictions), often mitigating by generating longer sequences.

on-policy: RL training where the data used for updates is generated by the current version of the policy being optimized.

rope_theta: A parameter in RoPE (Rotary Positional Embeddings) that controls the wavelength of position encodings; increasing it allows models to handle longer context windows.

DeepSeek-R1: A frontier reasoning model used here as a teacher to generate synthetic SFT data.

distillation: Training a smaller student model to mimic the outputs of a larger teacher model.