MSARL: Multi-Small-Agent Reinforcement Learning—the proposed framework decoupling reasoning and tool interpretation agents
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of samples for the same input to stabilize training
TIR: Tool-Integrated Reasoning—systems where LLMs utilize external tools (like code interpreters) to solve problems
SFT: Supervised Fine-Tuning—training models on labeled datasets before applying reinforcement learning
Cognitive Load: The mental effort required to process information; here, the conflict between high-level logic and low-level code syntax management
Code Sandbox: An isolated environment for safely executing code generated by the model
Nucleus Sampling: A text generation method (Top-p) where the next token is chosen from the smallest set of top tokens whose cumulative probability exceeds p
OR: Outcome Reward—reward given only at the end of a task based on the final result
PRM: Process Reward Model—a model that evaluates the correctness of intermediate reasoning steps