AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

📝 Paper Summary

Reasoning LLMs Reinforcement Learning (RL) for LLMs

AceReason demonstrates that large-scale reinforcement learning with a specific math-then-code curriculum significantly improves reasoning in small and mid-sized distilled models, challenging the belief that RL is only effective for huge models.

Core Problem

Training recipes for high-performing reasoning models are elusive, with prevailing reports suggesting RL is ineffective for smaller models (<32B) compared to distillation.

Why it matters:

Frontier model details (e.g., DeepSeek-R1) are often omitted, hindering reproduction
Small/mid-sized models are critical for efficient deployment but have historically struggled to gain from RL
Domain-specific tuning often leads to catastrophic forgetting (e.g., learning code degrades math skills)

Concrete Example: When DeepSeek-R1-Distill-Qwen-7B is trained further, standard domain-specific SFT often degrades performance in other domains. The paper shows that without the proposed curriculum, training on code can harm math accuracy, whereas their approach boosts both (+14.6% Math, +6.8% Code for 7B).

Key Novelty

Math-to-Code Sequential RL Curriculum

Trains the model first on math-only prompts (which verifies faster), then on code-only prompts, leveraging a finding that math training boosts code reasoning skills
Uses strict on-policy GRPO with no KL divergence penalty to maintain stability and prevent entropy collapse without a separate value model
Implements a stage-wise length extension curriculum (8K → 16K → 24K → 32K tokens) to efficiently scale reasoning depth

Evaluation Highlights

+17.2% improvement on AIME 2025 math benchmark for the 14B model using Math-only RL
+6.8% improvement on LiveCodeBench v5 for the 7B model using Math-only RL (demonstrating cross-domain transfer)
Final 14B model achieves 58.9% on LiveCodeBench, outperforming the specialized DeepCoder-14B (57.9%)

Breakthrough Assessment

8/10

Successfully debunks the myth that RL is ineffective for small distilled models. Provides a clear, high-performing recipe with strong cross-domain transfer results.

⚙️ Technical Details

Problem Definition

Setting: Reasoning task where model generates a Chain-of-Thought (CoT) followed by a final answer, optimized via Reinforcement Learning

Inputs: Prompt q (Math problem or Coding problem)

Outputs: Response containing reasoning trace <think>...</think> and final answer/code

Pipeline Flow

Input Processing
Reasoning Generation (AceReason-Nemotron)

System Modules

AceReason-Nemotron

Generate reasoning trace (thoughts) and final answer

Model or implementation: DeepSeek-R1-Distill-Qwen2.5 (7B or 14B base)

Novel Architectural Elements

Sequential Math-then-Code RL training pipeline designed to leverage cross-domain transfer (Math RL improving Code) and verify math samples faster

Modeling

Base Model: DeepSeek-R1-Distill-Qwen2.5-7B and 14B

Training Method: Group Relative Policy Optimization (GRPO) without KL penalty

Objective Functions:

Purpose: Optimize policy to maximize expected reward using group-normalized advantages.

Formally: REINFORCE objective with Advantage A_it = (Reward_i - Mean(Rewards)) / Std(Rewards).
Purpose: Eliminate KL divergence constraint to simplify optimization.

Formally: Beta (KL coefficient) set to 0.

Training Data:

Math: ~49,000 verified problems from DeepScaler/NuminaMath, filtered for solvability and difficulty (2k-4k response tokens)
Code: 8,520 verified competitive programming problems from online platforms, filtered for deterministic testing

Key Hyperparameters:

learning_rate_math: 1e-6
learning_rate_code: 5e-6
batch_size: 128
+ 3 more
group_size_G: 8 (for 8K length) or 16 (otherwise)
beta_kl: 0
optimizer: AdamW

Compute: 128 NVIDIA H100 GPUs. Verification: ~3.9s per 1024 math instances, ~552.4s per 1024 code instances.

Comparison to Prior Work

vs. DeepSeek-R1: Applies RL to the distilled student models (7B/14B) rather than just the large teacher; shows RL works for small models
vs. DeepCoder-14B: Achieves higher code performance (58.9% vs 57.9%) despite being a generalist trained on Math first
vs. DeepScaler [not cited in paper]: DeepScaler uses iterative SFT; this work uses on-policy RL (GRPO) which they claim pushes limits further

Limitations

Verification relies on deterministic answers (math) or test cases (code), limiting applicability to open-ended tasks
High computational cost (128 H100 GPUs) limits accessibility for smaller labs
Training recipe is sensitive to hyperparameters like response length schedules
No direct comparison to PPO with a tuned critic model (only GRPO)

Reproducibility

Available: Model weights on HuggingFace. Dataset to be open-sourced. Missing: Training code repository. Closed-source dependencies: Uses vLLM for generation.

📊 Experiments & Results

Evaluation Setup

Reasoning evaluation on challenging Math and Code benchmarks using pass@1

Benchmarks:

AIME 2024 / 2025 (Math Competitions)
LiveCodeBench (v5) (Code Generation / Competitive Programming)

Metrics:

Pass@1
Pass@k (up to k=1024)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Math-only RL experiments demonstrate significant gains on Math benchmarks and unexpected cross-domain gains on Code benchmarks.
AIME 2025	Pass@1	46.1	60.7	+14.6
AIME 2025	Pass@1	70.0	87.2	+17.2
LiveCodeBench v5	Pass@1	37.6	44.4	+6.8
LiveCodeBench v5	Pass@1	53.1	58.9	+5.8
Code-only RL (applied after Math RL) further boosts code scores with minimal degradation to math scores.
LiveCodeBench v5	Pass@1	44.4	48.7	+4.3
AIME 2025	Pass@1	60.7	59.9	-0.8

Experiment Figures

Impact of response length extension on AIME 2024 accuracy

Entropy curves for different gradient update strategies

Improvement in code pass rates across different topics after Math-only RL

Main Takeaways

Large-scale RL is highly effective for enhancing reasoning in small/mid-sized distilled models, contrary to prior beliefs favouring pure distillation.
Math-only RL acts as a foundational reasoning booster, improving performance even on code tasks (Cross-domain generalization).
A sequential curriculum (Math RL → Code RL) prevents catastrophic forgetting, allowing the model to excel in both domains simultaneously.
Strict on-policy updates (single gradient step per group) are crucial for preventing entropy collapse during training.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Policy Gradients
Language Model Distillation
Chain-of-Thought Reasoning

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of samples for the same input, eliminating the need for a critic model

SFT: Supervised Fine-Tuning—training the model on labeled examples (prompt, answer) before applying RL

On-policy: RL training where the data used for updates is generated by the current version of the model, ensuring the update direction is valid

Entropy collapse: A failure mode in RL where the model's policy becomes deterministic too quickly, losing exploration capabilities

CoT: Chain-of-Thought—a prompting or generation strategy where the model produces intermediate reasoning steps before the final answer

LiveCodeBench: A benchmark for evaluating code generation models on competitive programming problems, often using fresh problems to avoid contamination

AIME: American Invitational Mathematics Examination—a challenging math competition used as a benchmark for reasoning models

REINFORCE: A fundamental policy gradient algorithm in RL that updates model weights to maximize expected reward