ToRL: Tool-Integrated Reinforcement Learning—the proposed framework for training base models to use tools via RL without SFT
TIR: Tool-Integrated Reasoning—interleaving natural language reasoning with executable code blocks to solve problems
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used here to optimize the model policy based on group scores
SFT: Supervised Fine-Tuning—training models on labeled examples; the paper contrasts ToRL against this traditional approach
Base Model: A pre-trained language model that has not undergone instruction tuning or RLHF
Sandbox Fusion: The specific isolated code execution environment used to run model-generated Python code safely
CoT: Chain-of-Thought—a reasoning technique where models generate intermediate steps; ToRL augments this with executable code
Pass Ratio: The proportion of responses that lead to a correct final answer
Metacognition: The model's ability to monitor and regulate its own cognitive processes, such as recognizing when code generation is ineffective