ToRL: Scaling Tool-Integrated RL

📝 Paper Summary

Agentic AI Reinforcement Learning (RL) for Reasoning

ToRL scales reinforcement learning directly on base models to autonomously discover optimal tool-use strategies, bypassing the limitations of supervised fine-tuning on predetermined trajectories.

Core Problem

Existing Tool-Integrated Reasoning (TIR) methods rely on Supervised Fine-Tuning (SFT) from distilled trajectories, which restricts models to imitating fixed patterns and prevents exploration of optimal strategies.

Why it matters:

Pure language models struggle with complex calculations and precise equation solving compared to code-augmented models
Imitation-based SFT limits models to human-designed or distilled tool usage patterns, hindering the emergence of novel or more efficient reasoning paths
Prior RL approaches typically start from SFT-aligned models, obscuring whether tool capabilities can emerge from scratch through pure reward signals

Concrete Example: When solving a complex math problem, a standard SFT model might blindly follow a fixed 'reason-then-code' pattern it was trained on. In contrast, a ToRL model might attempt code, fail with a syntax error, self-correct based on the error message, and then switch to analytical reasoning, a behavior learned dynamically through exploration.

Key Novelty

Tool-Integrated Reinforcement Learning (ToRL) from Base Models

Applies reinforcement learning directly to base models (without prior instruction tuning) with a code interpreter integrated into the interaction loop
Allows the model to learn tool invocation, code generation, and self-correction solely through outcome-based rewards (correct/incorrect answer) rather than imitating demonstrations

Architecture

Conceptual flow of the Tool-Integrated Reasoning (TIR) rollout process.

Evaluation Highlights

ToRL-7B achieves 43.3% accuracy on AIME24, surpassing the best existing Tool-Integrated Reasoning (TIR) model by ~17% (absolute)
Outperforms standard RL training without tool integration by ~14% (absolute) on AIME24
ToRL-1.5B achieves 48.5% average accuracy across benchmarks, beating Qwen2.5-Math-1.5B-Instruct-TIR (41.3%)

Breakthrough Assessment

8/10

Strong empirical evidence that RL can induce complex tool-use behaviors from scratch in base models, significantly outperforming SFT-based baselines. The emergence of self-correction without explicit instruction is a key finding.

⚙️ Technical Details

Problem Definition

Setting: Mathematical problem solving where a language model M interacts with a code interpreter I to generate a reasoning trajectory s_k containing text, code, and execution results.

Inputs: Input question Q

Outputs: Final Answer (after iterative text generation and code execution)

Pipeline Flow

LLM Generation (Text + Code)
Stop Check (Wait for code block termination)
Code Execution (Sandbox Fusion)
Result Injection (Observation)
Continued Generation (Loop until final answer)

System Modules

Policy Model

Generate reasoning text and code blocks

Model or implementation: Qwen2.5-Math-Base (1.5B and 7B)

Code Interpreter

Execute Python code generated by the model

Model or implementation: Sandbox Fusion

Reward System

Evaluate final answer correctness

Model or implementation: Rule-based

Novel Architectural Elements

Integration of a code interpreter directly into the reinforcement learning rollout loop for base models, allowing dynamic trajectory generation during training rather than replay of static SFT data

Modeling

Base Model: Qwen-2.5-Math (1.5B and 7B base versions)

Training Method: Reinforcement Learning (GRPO algorithm)

Objective Functions:

Purpose: Maximize expected reward for correct answers.

Formally: GRPO objective (Group Relative Policy Optimization).
Purpose: Penalize execution failures (Ablation only).

Formally: -0.5 reward for non-executable code (found ineffective in main results).

Key Hyperparameters:

rollout_batch_size: 128
samples_per_problem: 16
temperature: 1
+ 2 more
kl_loss_coefficient: 0 (omitted)
max_tool_calls_C: 1 (default), tested up to 2

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. Qwen2.5-Math-Instruct-TIR: ToRL trains from base models via RL exploration, whereas TIR baselines use SFT on fixed data
vs. Traditional RL (CoT): ToRL integrates a code interpreter into the RL loop, enabling computational feedback
vs. DeepScaleR: ToRL focuses on tool integration (TIR) scaling rather than pure CoT scaling

Limitations

High computational overhead during training due to tool integration (rollout speed inversely proportional to tool call frequency)
Increasing max tool calls (C) improves performance but significantly reduces training speed
Explicit execution-based penalties (reward shaping) were found to degrade performance rather than help
Tested primarily on math benchmarks; generalization to other tool-use domains is not explored

Reproducibility

Code: https://github.com/GAIR-NLP/ToRL

Code, datasets, and models are open-sourced at https://github.com/GAIR-NLP/ToRL. The paper specifies the use of 'veRL' framework and 'Sandbox Fusion'. Prompt templates are referenced (Figure 3).

📊 Experiments & Results

Evaluation Setup

Mathematical problem solving with access to a Python code interpreter.

Benchmarks:

AIME24 (Competition Math)
AIME25 (Competition Math)
MATH500 (Math Problems)
OlympiadBench (Olympiad Math)
AMC23 (Competition Math)

Metrics:

Accuracy (Pass@1 with greedy decoding)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results demonstrating ToRL's performance against SFT and pure-RL baselines across multiple model sizes.
Average (5 benchmarks)	Accuracy	41.3	48.5	+7.2
Average (5 benchmarks)	Accuracy	35.9	48.5	+12.6
AIME24	Accuracy	26.3	43.3	+17.0
Average Accuracy	Accuracy	46.5	48.5	+2.0

Main Takeaways

RL from base models significantly outperforms SFT on distilled trajectories for tool-integrated reasoning.
Models autonomously learn to increase code usage frequency and correctness over training steps without explicit supervision.
Emergent behaviors include self-correction after error messages and strategic switching between code and text reasoning.
There is a trade-off between performance and training efficiency: allowing more tool calls (C=2) helps accuracy but slows down training.
Penalty-based rewards for execution errors are counterproductive; simple outcome-based rewards work best.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically GRPO)
Tool-Integrated Reasoning (TIR)
Large Language Models (Base vs. Instruct)

Key Terms

ToRL: Tool-Integrated Reinforcement Learning—the proposed framework for training base models to use tools via RL without SFT

TIR: Tool-Integrated Reasoning—interleaving natural language reasoning with executable code blocks to solve problems

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used here to optimize the model policy based on group scores

SFT: Supervised Fine-Tuning—training models on labeled examples; the paper contrasts ToRL against this traditional approach

Base Model: A pre-trained language model that has not undergone instruction tuning or RLHF

Sandbox Fusion: The specific isolated code execution environment used to run model-generated Python code safely

CoT: Chain-of-Thought—a reasoning technique where models generate intermediate steps; ToRL augments this with executable code

Pass Ratio: The proportion of responses that lead to a correct final answer

Metacognition: The model's ability to monitor and regulate its own cognitive processes, such as recognizing when code generation is ineffective