rStar2-Agent: Agentic Reasoning Technical Report

📝 Paper Summary

Agentic reinforcement learning Math reasoning Code generation for reasoning

rStar2-Agent trains a 14B model to achieve frontier-level math reasoning by using a novel reinforcement learning strategy that filters noisy code execution feedback and prioritizes high-quality, correct reasoning trajectories.

Core Problem

Standard RL for reasoning fails in coding environments because outcome-only rewards validate trajectories even when intermediate code steps are broken or noisy, teaching models that errors are acceptable.

Why it matters:

Current reasoning models rely on 'thinking longer' (Long CoT) but struggle with tasks requiring external verification or creative shifts
Outcome-only rewards in noisy environments lead to reward hacking, where models produce lengthy, low-quality trajectories with many tool errors
Scaling agentic RL is computationally expensive due to the high cost of rollouts and concurrent tool execution environments

Concrete Example: A model might write incorrect Python code to solve a math problem, receive an error message, try again, and eventually guess the right answer. Standard RL rewards this entire messy trajectory as 'correct' (reward=1), reinforcing the behavior of writing buggy code and wasting tokens on error correction.

Key Novelty

Group Relative Policy Optimization with Resample-on-Correct (GRPO-RoC)

Oversamples rollout trajectories and applies asymmetric filtering: keeps all failure modes to learn what to avoid, but aggressively filters successful trajectories to keep only those with clean code and minimal errors
Introduces a specialized infrastructure that balances rollout requests based on GPU KV cache capacity, enabling massive-scale training (45K concurrent tool calls) on limited hardware

Architecture

The multi-turn agentic rollout process where the model interacts with a Python environment

Evaluation Highlights

Achieves 80.6% pass@1 on AIME 2024, outperforming OpenAI o3-mini (medium) and DeepSeek-R1 (671B)
Surpasses DeepSeek-R1 on AIME 2025 with 69.8% accuracy while using a significantly smaller 14B model
Reached state-of-the-art performance in only 510 RL steps over one week using 64 MI300X GPUs

Breakthrough Assessment

9/10

Demonstrates that small (14B) models can beat massive frontier models (671B) on hard math benchmarks through specialized agentic RL, effectively solving the noisy-reward problem in tool-use training.

⚙️ Technical Details

Problem Definition

Setting: Math problem solving using Python code generation and execution as an external tool

Inputs: Math word problem q

Outputs: Final numerical answer verified against ground truth

Pipeline Flow

Input Question
Reasoning Generation (Assistant Role)
Tool Call Extraction (JSON)
Environment Execution (Python Interpreter)
Feedback Integration (User Role)
Reasoning Continuation
Final Answer Extraction

System Modules

Reasoning Agent

Generates natural language reasoning and structured JSON tool calls

Model or implementation: Qwen2.5-Math-7B or Qwen2.5-32B-Instruct (base for initialization)

Code Environment

Executes Python code and returns standard output, error logs, or timeout signals

Model or implementation: Python Interpreter with Numpy, Scipy, SymPy

Novel Architectural Elements

Resample-on-Correct (RoC) rollout sampler: An asymmetric sampling layer between environment rollouts and policy updates that aggressively filters positive trajectories based on tool error rates and formatting penalties

Modeling

Base Model: Qwen2.5-Math-7B, Qwen2.5-32B-Instruct, initialized as rStar2-Agent-14B

Training Method: Group Relative Policy Optimization with Resample-on-Correct (GRPO-RoC)

Objective Functions:

Purpose: Maximize probability of high-quality trajectories relative to group average.

Formally: standard GRPO objective maximizing advantage A_{i,t} based on binary rewards r_i.
Purpose: Penalize low-quality successful trajectories during sampling.

Formally: Sampling probability P ~ 1 / (p_err + p_format), where p_err is tool error ratio and p_format is formatting violation count.

Adaptation: Full model training

Training Data:

NuminaMath, K-Level math problems for SFT
Math datasets for RL (AIME, AMC, etc.)

Key Hyperparameters:

clip_epsilon_low: 0.2
clip_epsilon_high: 0.28 (Clip-Higher strategy)
beta_kl: Not explicitly reported
+ 2 more
learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper

Compute: 64x MI300X GPUs, training completed in 1 week

Comparison to Prior Work

vs. DeepSeek-R1: Achieves similar/better performance at 14B scale (vs 671B) by specializing in agentic tool use rather than just long CoT
vs. Qian et al. (2025): Uses sampling strategies (RoC) to filter noise instead of modifying the reward function, avoiding reward hacking and complexity
vs. Standard GRPO: Adds asymmetric resampling to handle the high noise inherent in code generation environments

Limitations

Relies on verifiable rewards (math answers), limiting applicability to open-ended tasks
Requires a high-throughput code execution environment which adds infrastructure complexity
Specific hyperparameters like learning rate and batch size are not detailed in the text

Reproducibility

Code: https://github.com/microsoft/rStar

Code and training recipes available at https://github.com/microsoft/rStar. Exact learning rates and batch sizes are not explicitly listed in the main text.

📊 Experiments & Results

Evaluation Setup

Math reasoning benchmarks requiring complex problem solving and tool use

Benchmarks:

AIME 2024 (Challenging Math Competition)
AIME 2025 (Challenging Math Competition)
HMMT 2025 (Challenging Math Competition)
GPQA-Diamond (Scientific Reasoning)
BFCL v3 (Agentic Tool Use)

Metrics:

Pass@1 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on high-difficulty math competitions demonstrates rStar2-Agent's ability to surpass much larger frontier models.
AIME 2024	Pass@1	79.8	80.6	+0.8
AIME 2025	Pass@1	64.3	69.8	+5.5
HMMT 2025	Pass@1	51.8	52.7	+0.9
Generalization capabilities beyond pure math show the method improves broader reasoning and tool use.
GPQA-Diamond	Accuracy	59.1	68.9	+9.8
BFCL v3	Accuracy	44.6	59.1	+14.5

Experiment Figures

Comparison of tool error rates in positively rewarded trajectories between naive GRPO and GRPO-RoC

Correlation between tool call errors and reasoning performance/response length

Main Takeaways

Small models (14B) can outperform massive reasoning models (671B) when equipped with agentic RL and code interpreters
Filtering positive trajectories based on code quality (RoC) is more effective than simple outcome-based rewards for tool-use training
The model generalizes well to scientific reasoning and general tool use despite being trained primarily on math problems
Agentic RL incentivizes advanced cognitive behaviors like reflection on error messages without explicit supervision

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically PPO/GRPO)
Chain-of-Thought (CoT) reasoning
Python tool use in LLMs

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy based on the relative performance of a group of outputs for the same input, often without a separate value network

RoC: Resample-on-Correct—a strategy that oversamples trajectories and filters the positive ones to retain only high-quality traces (few errors/formatting issues) for training

outcome-only reward: A reward signal given solely based on whether the final answer is correct, ignoring the quality of intermediate steps

KV cache: Key-Value cache—memory used by LLMs to store attention mechanism computations, optimizing generation speed but consuming GPU memory

SFT: Supervised Fine-Tuning—training a model on labeled examples to establish basic capabilities before reinforcement learning

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer

reward hacking: When an RL agent finds a way to maximize the reward signal (e.g., getting the right answer) using undesirable behaviors (e.g., guessing or writing messy code)

rollout: The process of generating a complete sequence of actions (tokens and tool calls) from the policy during RL training