← Back to Paper List

AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

Yang Chen, Zhuoling Yang, Zihan Liu, Chankyu Lee, Peng Xu, M. Shoeybi, Bryan Catanzaro, Wei Ping
NVIDIA
arXiv.org (2025)
RL Reasoning Benchmark

📝 Paper Summary

Reasoning LLMs Reinforcement Learning (RL) for LLMs
AceReason demonstrates that large-scale reinforcement learning with a specific math-then-code curriculum significantly improves reasoning in small and mid-sized distilled models, challenging the belief that RL is only effective for huge models.
Core Problem
Training recipes for high-performing reasoning models are elusive, with prevailing reports suggesting RL is ineffective for smaller models (<32B) compared to distillation.
Why it matters:
  • Frontier model details (e.g., DeepSeek-R1) are often omitted, hindering reproduction
  • Small/mid-sized models are critical for efficient deployment but have historically struggled to gain from RL
  • Domain-specific tuning often leads to catastrophic forgetting (e.g., learning code degrades math skills)
Concrete Example: When DeepSeek-R1-Distill-Qwen-7B is trained further, standard domain-specific SFT often degrades performance in other domains. The paper shows that without the proposed curriculum, training on code can harm math accuracy, whereas their approach boosts both (+14.6% Math, +6.8% Code for 7B).
Key Novelty
Math-to-Code Sequential RL Curriculum
  • Trains the model first on math-only prompts (which verifies faster), then on code-only prompts, leveraging a finding that math training boosts code reasoning skills
  • Uses strict on-policy GRPO with no KL divergence penalty to maintain stability and prevent entropy collapse without a separate value model
  • Implements a stage-wise length extension curriculum (8K → 16K → 24K → 32K tokens) to efficiently scale reasoning depth
Evaluation Highlights
  • +17.2% improvement on AIME 2025 math benchmark for the 14B model using Math-only RL
  • +6.8% improvement on LiveCodeBench v5 for the 7B model using Math-only RL (demonstrating cross-domain transfer)
  • Final 14B model achieves 58.9% on LiveCodeBench, outperforming the specialized DeepCoder-14B (57.9%)
Breakthrough Assessment
8/10
Successfully debunks the myth that RL is ineffective for small distilled models. Provides a clear, high-performing recipe with strong cross-domain transfer results.
×