ReTool: Reinforcement learning for startegic tool use in LLMs

📝 Paper Summary

Tool-use post-training RL-based

ReTool enhances LLM mathematical reasoning by integrating a sandboxed code interpreter directly into the RL rollout process, allowing the model to learn outcome-driven strategies for when and how to execute code.

Core Problem

Reasoning models like DeepSeek R1 struggle with tasks requiring precise calculation or symbolic manipulation (e.g., geometric reasoning) because pure textual chain-of-thought lacks reliable verification and computational power.

Why it matters:

Text-based reasoning suffers from ambiguity and compounding errors in numerical steps
Supervised tool-use methods are limited to imitating curated patterns and often fail to generalize or adaptively decide when to invoke tools
Models need to autonomously discover optimal tool invocation patterns (e.g., self-correction) rather than relying on brittle human priors

Concrete Example: In a complex equation solving task, a text-only model might hallucinate an intermediate calculation step, leading to a wrong final answer. ReTool, however, writes a code block to solve the equation, executes it in a sandbox, and uses the precise return value to continue reasoning.

Key Novelty

ReTool (Tool-augmented Reinforcement Learning)

Integrates a code interpreter sandbox directly into the PPO rollout loop, allowing the policy to generate 'hybrid' traces containing text, code, and execution feedback
Utilizes a cold-start data pipeline that converts textual reasoning steps into code-augmented traces to initialize the model before RL
Optimizes tool-use strategy via outcome-based rewards, enabling emergent behaviors like code self-correction without explicit supervision on tool mechanics

Architecture

Comparison between Text-based Rollout and Tool-Integrated Rollout (ReTool). It illustrates how ReTool pauses generation to execute code and injects the output back into the context.

Evaluation Highlights

+27.0% accuracy improvement on AIME 2024 (67.0% vs 40.0%) for Qwen2.5-32B-Instruct compared to text-based RL
Surpasses OpenAI o1-preview by 27.9% on AIME extended settings using ReTool-32B
Response length reduced by ~40% post-training, indicating higher efficiency when offloading computation to code

Breakthrough Assessment

8/10

Significant performance gains on hard math benchmarks (AIME) by successfully integrating code execution into the RL reasoning loop. Demonstrates emergent self-correction behavior.

⚙️ Technical Details

Problem Definition

Setting: Mathematical problem solving with access to an external code interpreter

Inputs: Natural language math problem q

Outputs: Final answer a (within a specified format, e.g., \boxed{})

Pipeline Flow

Policy LLM (generates reasoning text)
Code Parser (detects code blocks)
Code Sandbox (executes code)
Feedback Loop (returns execution output to LLM context)

System Modules

Policy Model

Generates natural language reasoning and writes code blocks wrapped in specific tags

Model or implementation: Qwen2.5-32B-Instruct (or DeepSeek-R1-Distill-Qwen-32B)

Code Sandbox

Executes the parsed code snippet in a secure environment

Model or implementation: Asynchronous python executor

Novel Architectural Elements

Interleaved Rollout Mechanism: Pause-and-resume generation during RL rollouts to inject real-time sandbox execution results
Asynchronous Sandbox Pool: Distributed worker pool for parallel code execution during RL training to prevent bottlenecks

Modeling

Base Model: Qwen2.5-32B-Instruct (also tested with DeepSeek-R1-Distill-Qwen-32B)

Training Method: PPO (Proximal Policy Optimization)

Objective Functions:

Purpose: Maximize expected reward while staying close to reference model.

Formally: Standard PPO objective maximizing advantage A(q, o) with KL penalty.
Purpose: Assign reward based on final answer correctness.

Formally: r(a, \hat{a}) = 1 if verify(a, \hat{a}) else 0

Adaptation: Full fine-tuning (implied by context of RL on 32B)

Training Data:

D_init: High-quality text-based reasoning data (filtered via human experts + DeepSeek-R1)
D_CI: Synthetic code-augmented traces created by transforming D_init via structured prompts and verifying execution

Key Hyperparameters:

learning_rate: 1e-6
optimizer: AdamW
batch_size: 512
+ 3 more
kl_coefficient: 0.0
max_sequence_length: 16384 tokens
epochs: 2 (for cold-start phase)

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek R1: ReTool integrates external code execution tools directly into the RL rollout, whereas R1 relies on internal text-based chain-of-thought
vs. Supervised Tool-Use (e.g., Toolformer [not cited in paper]): ReTool uses RL with outcome rewards to discover strategies, rather than just imitating human/synthetic tool-use traces
vs. QwQ-32B-Preview: ReTool explicitly leverages code interpreters for calculation, achieving higher accuracy on math benchmarks

Limitations

Dependency on the quality of the initial cold-start dataset for tool usage competency
Outcome-based sparse reward signal (binary correctness) might be inefficient for learning complex tool interactions compared to process supervision
Overhead of invoking external sandbox during inference and training compared to pure text generation

Reproducibility

Code: https://retool-rl.github.io/

Project page is available (https://retool-rl.github.io/), but code and model weights are marked as 'not yet released' in the paper text context. VeRL framework is used for training (publicly available).

📊 Experiments & Results

Evaluation Setup

Evaluation on challenging math competition problems requiring multi-step reasoning

Benchmarks:

AIME 2024 (Math Olympiad Problems)
AIME 2025 (Math Olympiad Problems)

Metrics:

Accuracy (Pass@1)
Statistical methodology: Repeated evaluation 32 times and reported overall average accuracy to estimate pass@1

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on AIME 2024 showing ReTool significantly outperforming baselines with fewer training steps.
AIME 2024	Accuracy	40.0	67.0	+27.0
AIME 2024	Accuracy	26.7	67.0	+40.3
AIME 2024	Accuracy	56.7	67.0	+10.3
Results on AIME 2025 showing generalization and performance against proprietary models.
AIME 2025	Accuracy	36.7	49.3	+12.6
AIME 2025	Accuracy	37.9	49.3	+11.4
Cold-start performance showing the effectiveness of the data pipeline.
AIME 2024	Accuracy	26.7	40.9	+14.2

Experiment Figures

Accuracy vs Training Steps curve on AIME 2024 comparing ReTool against Text-based RL.

Main Takeaways

Tool-integrated RL is significantly more sample-efficient than text-based RL (400 steps vs 1080 steps for better performance).
The model learns emergent behaviors such as self-correction (fixing code errors based on sandbox feedback) without explicit supervision.
Code usage patterns shift during training: earlier invocation of tools and higher complexity of code snippets.
Response length decreases by ~40% after RL, suggesting that offloading computation to code is more token-efficient than textual chain-of-thought.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO)
Chain-of-Thought (CoT) reasoning
Code Interpreters in LLMs
Supervised Fine-Tuning (SFT)

Key Terms

PPO: Proximal Policy Optimization—a reinforcement learning algorithm that updates a policy in small, stable steps

Rollout: The process where the model generates a sequence of actions (text or code) to solve a problem during RL training

Cold-start: Initial supervised training phase using curated data to give the model basic competency before RL

Sandboxed Code Interpreter: An isolated environment where code generated by the model is executed safely, returning results or errors

KV-Cache: Key-Value Cache—stored attention computations used to speed up generation; here, reused to optimize repeated rollouts

AIME: American Invitational Mathematics Examination—a challenging math competition benchmark

Pass@1: The accuracy metric measuring if the model's single generated answer is correct