Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving

📝 Paper Summary

Self-evolving Agentic reasoning RL-based tool use Mathematical reasoning

ZeroTIR trains base LLMs to spontaneously learn code execution for math via outcome-based reinforcement learning, revealing scaling laws where accuracy, response length, and tool usage increase predictably with training steps.

Core Problem

LLMs struggle with precise math calculations; Supervised Fine-Tuning (SFT) limits exploration to specific patterns, while existing Tool-Integrated Reasoning (TIR) relies on rigid prompts rather than spontaneous, learned tool use.

Why it matters:

Next-token prediction often hallucinates calculations, whereas code execution provides deterministic correctness
Reliance on SFT trajectory data is expensive and constrains the model's ability to discover novel problem-solving strategies
Existing ZeroRL approaches often ignore external tools, missing the potential for agents to offload computation autonomously

Concrete Example: When asked to solve a complex equation, a standard LLM might hallucinate an incorrect arithmetic step. In contrast, the ZTRL agent spontaneously generates a Python script to calculate the roots numerically, executes it, and uses the output to derive the correct answer.

Key Novelty

Agent RL Scaling Law & ZeroTIR Framework

Demonstrates a quantifiable 'Scaling Law' in Agent RL: as training steps increase, the model spontaneously increases code execution frequency and response length, correlating strongly with accuracy
Introduces ZeroTIR: a framework to train general base models (not math-specialized) to use code interpreters from scratch using only outcome-based rewards, without supervised tool-use examples

Architecture

The state-machine interaction mechanism for spontaneous code execution during RL rollouts.

Evaluation Highlights

7B ZTRL model achieves 54.0% average accuracy on AIME24, AIME25, and MATH500, outperforming the math-specialized TORL baseline (51.8%)
Surpasses SFT-based Qwen 2.5 Math Instruct (with TIR) by +10.7 percentage points (52.3% vs 41.6%) on the aggregated math benchmark
Increasing the tool interaction cap from 0 to 4 boosts average performance by up to ~15 percentage points, validating the benefit of tool use

Breakthrough Assessment

8/10

Strong empirical evidence for spontaneous tool emergence via pure RL (ZeroRL) on base models. The identification of scaling laws for agentic behaviors is a significant contribution to understanding agent training dynamics.

⚙️ Technical Details

Problem Definition

Setting: Mathematical problem solving where an agent generates a trajectory y (reasoning + code) for input x to maximize outcome-based reward R(x, y)

Inputs: Natural language math problem x

Outputs: Final answer y derived from interleaved text generation and code execution

Pipeline Flow

LLM Agent (Reasoning & Code Gen)
Stop Token Detection
External Code Environment
Context Integration

System Modules

LLM Agent

Generates reasoning steps and Python code blocks enclosed in specific tags

Model or implementation: Qwen 2.5 Base (7B or 32B)

Stop Token Detector

Pauses generation when a dynamic stop token (e.g., ```python) is detected

Model or implementation: Rule-based logic

External Code Environment

Executes the generated Python code in a decoupled, sandboxed environment

Model or implementation: Python Interpreter (Service)

Context Integrator

Appends execution output to the context (as 'Tool Output') and triggers the LLM to resume generation

Model or implementation: Rule-based logic

Novel Architectural Elements

Decoupled asynchronous code execution service integrated into OpenRLHF training pipeline
Replay buffer filtering mechanism based on group accuracy (filtering out very high/low prob samples) to stabilize ZeroRL

Modeling

Base Model: Qwen 2.5 Base (7B and 32B parameters)

Training Method: Reinforcement Learning (PPO and Reinforce++)

Objective Functions:

Purpose: Maximize expected outcome-based reward while staying close to reference model.

Formally: J(θ) = E[R(x, y) - β * KL(π_θ || π_ref)]

Adaptation: Full model update (implied by RL on base model)

Training Data:

ORZ-57k dataset
DeepMath dataset (verifiable math problems)

Key Hyperparameters:

rollout_batch_size: 128
samples_per_prompt: 16
policy_update_steps: 1
+ 3 more
critic_update_steps: 12 (for PPO)
micro_batch_size: 1
max_tool_calls_N_max: 20 (limit used during training/eval)

Compute: Asynchronous pipeline reported to be 1.6x faster than basic async rollout and 4x faster than synchronous interaction. Specific GPU hours not reported.

Comparison to Prior Work

vs. TORL: ZTRL trains on a general base model (Qwen Base) rather than a math-specialized base (Qwen Math Base) yet achieves higher accuracy (54.0% vs 51.8% on AIME/MATH)
vs. Qwen 2.5 Math Instruct: ZTRL learns tool use spontaneously via RL, outperforming the SFT-based tool use of the Instruct model
vs. SimpleRL-Zero: ZTRL integrates code execution, showing significantly higher performance on math tasks compared to pure text reasoning

Limitations

Computational cost of RL training is high, especially with multiple tool execution rollouts
Large models (32B) benefit less from high interaction caps compared to smaller models, suggesting diminishing returns
Analysis is limited to mathematical reasoning tasks; generalization to other agentic domains (e.g., web browsing) is not tested
Value function estimation in PPO requires careful masking of environment-generated tokens

Reproducibility

Code: https://github.com/yyht/openrlhf_async_pipline

Code is publicly available at https://github.com/yyht/openrlhf_async_pipline. Training datasets (ORZ-57k, DeepMath) and base models (Qwen 2.5) are public. RL hyperparameters are detailed in the paper.

📊 Experiments & Results

Evaluation Setup

Mathematical problem solving with verifiable answers using Python code execution

Benchmarks:

MATH500 (Challenging math problems)
AIME 2024 / 2025 (American Invitational Mathematics Examination)
HMMT Feb 2024 / 2025 (Harvard-MIT Mathematics Tournament)
CMIMC (Carnegie Mellon Informatics and Mathematics Competition)

Metrics:

Accuracy (Greedy)
Pass@1
Majority Voting (Maj@k)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis of ZTRL (ZeroTIR) against SFT and other RL baselines on core math benchmarks using the Qwen 2.5 7B backbone.
AIME24 + AIME25 + MATH500 (Avg)	Average Accuracy	41.6	52.3	+10.7
AIME24 + AIME25 + MATH500 (Avg)	Average Accuracy	51.8	54.0	+2.2
MATH500	Accuracy	45.0	62.8	+17.8
Ablation study on the effect of increasing the maximum allowed tool calls (N_max) on model accuracy.
AIME 2024	Accuracy	33.3	50.0	+16.7
MATH500	Accuracy	43.4	61.0	+17.6

Experiment Figures

Evolution of metrics (Code Proportion, Length, Accuracy) over training steps.

Main Takeaways

Agent RL Scaling Law exists: Training steps positively correlate with code execution frequency, response length, and final accuracy.
Spontaneous Tool Use: Base models can learn to use tools effectively from scratch (ZeroTIR) without SFT, reaching ~90% code usage ratio.
General vs. Specialized Base: ZTRL on a general base model outperforms concurrent methods (TORL) trained on math-specialized base models.
Diminishing Returns on Interaction: While N_max=4 yields massive gains over N_max=0, increasing to N_max=20 offers marginal or no improvement for larger models.
Reinforce++ vs PPO: Reinforce++ converges faster (approx 300 steps earlier) than PPO to similar optimal performance levels.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, REINFORCE)
Large Language Models (LLMs)
Code Interpreters / Sandboxed Execution

Key Terms

ZeroTIR: Zero-shot Tool-Integrated Reasoning—training a model to use tools via RL without supervised examples

ZeroRL: Reinforcement Learning applied directly to base models (without SFT) using simple outcome-based rewards

SFT: Supervised Fine-Tuning—training models on labeled examples of inputs and desired outputs

PPO: Proximal Policy Optimization—an RL algorithm that updates policies with a clipped objective to ensure stability

Reinforce++: A variant of the REINFORCE algorithm that improves stability and performance for LLM reasoning tasks

GAE: Generalized Advantage Estimation—a method to estimate the advantage of an action by balancing bias and variance

Outcome-based reward: A binary reward signal given only at the end of a task (1 for correct answer, 0 for incorrect), as opposed to step-by-step process rewards

Pass@k: The probability that at least one of the top k generated solutions is correct