Reducing Cognitive Overhead in Tool Use via Multi-Small-Agent Reinforcement Learning

📝 Paper Summary

Multi-agent tool use Reinforcement Learning for Reasoning

MSARL decouples mathematical reasoning from tool execution using two specialized agents—a Reasoner and a Helper—trained via collaborative reinforcement learning to reduce cognitive interference and improve accuracy.

Core Problem

Single-agent systems combining high-level reasoning with low-level tool execution suffer from cognitive load interference, where managing code generation/interpretation degrades the quality of the reasoning itself.

Why it matters:

Coupling complex logic with precise tool syntax forces models to juggle competing cognitive demands, leading to unstable outputs
Current approaches relying on single-agent SFT (Supervised Fine-Tuning) fail to optimize the interaction between reasoning and tool use
Intermediate logical steps are often degraded when models attempt to interleave code execution, as evidenced by performance drops in medium-difficulty math problems

Concrete Example: In a pilot study on MATH-500, a single agent instructed to use code ('r_code') performed worse than a reasoning-only agent ('r_only') on Medium-Easy problems (-0.18 accuracy gap), showing that the burden of tool management disrupted the problem-solving logic.

Key Novelty

Multi-Small-Agent Reinforcement Learning (MSARL)

Explicitly decouples roles into a 'Reasoner' (planning, logic) and a 'Helper' (tool output interpretation/condensing) to prevent cognitive overload
Uses a collaboration-oriented reward mechanism where the 'Helper' is rewarded based on whether the 'Reasoner' successfully solves the problem using the Helper's output
Employs a hierarchical RL framework (based on GRPO) to jointly optimize both agents, treating the Helper's interpretation as a critical milestone

Architecture

The MSARL framework illustrating the interaction between the Reasoner and Helper agents.

Evaluation Highlights

Pilot study reveals a significant accuracy drop of 0.18 (18 percentage points) on Medium-Easy MATH-500 problems when forcing a single agent to use code versus reasoning alone
Accuracy gaps ranging from 0.02 to 0.18 observed across all difficulty levels for single-agent tool integration, confirming the 'cognitive interference' hypothesis

Breakthrough Assessment

7/10

Identifies and quantifies 'cognitive interference' in single-agent tool use and proposes a novel multi-agent RL solution. (Score limited as final performance tables are not in the provided text snippet).

⚙️ Technical Details

Problem Definition

Setting: Mathematical problem solving via code execution

Inputs: Natural language question q

Outputs: Final answer O (derived through interleaved reasoning and tool use)

Pipeline Flow

Reasoner (Decomposition & Tool Call)
Code Sandbox (Execution)
Helper (Interpretation)
Reasoner (Conclusion)

System Modules

Reasoner

Decomposes problems, generates reasoning traces, and issues tool calls

Model or implementation: LLM (policy π_reason)

Code Sandbox

Executes the code generated by the Reasoner

Model or implementation: Python Interpreter (Non-AI tool)

Helper (Tool Agent)

Processes raw tool output, condenses it, and generates a structured natural language interpretation

Model or implementation: LLM (policy π_tool)

Novel Architectural Elements

Explicit separation of the 'Helper' agent solely for interpreting/compressing tool outputs before returning control to the Reasoner
Feedback loop where the Helper's interpretation is fed back to the Reasoner to continue the trajectory

Modeling

Base Model: Qwen2.5-3B-Instruct, Qwen3-4B, Qwen2.5-Math-1.5B-Instruct (used in pilot/experiments)

Training Method: Hierarchical Reinforcement Learning (adapted GRPO)

Objective Functions:

Purpose: Optimize the Tool/Helper policy to generate interpretations that lead to correct final answers.

Formally: Ratio of new/old policy probabilities * Normalized Advantage (calculated from final answer correctness)
Purpose: Optimize the Reasoner policy to generate trajectories likely to succeed given the helper's input.

Formally: Ratio of new/old policy probabilities * Aggregated Advantage (average of advantages over multiple interpretations)

Key Hyperparameters:

C (Max Tool Calls): Not explicitly specified as a number, but defined as a threshold to force text reasoning if exceeded
sampling_n (Tool Interpretations): n (variable in formulation)
sampling_m (Reasoning Trajectories): m (variable in formulation)
+ 2 more
nucleus_sampling_p: 0.95
temperature: 0.7

Compute: Not reported in the paper

Comparison to Prior Work

vs. Integrated Agent: MSARL decouples the tool interpretation role to a separate 'Helper' agent, whereas integrated agents do both [not cited in paper but implied baseline]
vs. DeepSeek-R1: MSARL uses tool-augmented RL with role separation, while R1 focuses on pure reasoning chain optimization

Limitations

Agent-to-agent communication during rollout introduces significant GPU idle time
Requires defining a maximum tool call threshold (C) to prevent infinite loops
Relies on ground truth for the binary reward signal (Eq 1), which may limit applicability to open-ended tasks without clear answers

Reproducibility

Code: https://github.com/dayuwang401/MSARL-

Code is publicly available (https://github.com/dayuwang401/MSARL-). The paper includes prompt templates (Figure 3). Exact training compute resources and duration are not reported in the provided text.

📊 Experiments & Results

Evaluation Setup

Mathematical problem solving with optional code execution support

Benchmarks:

MATH-500 (Mathematical Problem Solving)

Metrics:

Accuracy (binary success compared to ground truth)
Pass@N
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Pilot study results demonstrating the 'cognitive overhead' problem. Comparing a single model prompted for Reasoning-Only (r_only) vs. Reasoning-with-Code (r_code).
MATH-500 (Medium-Easy)	Accuracy Gap (r_code - r_only)	0.00	-0.18	-0.18
MATH-500 (Medium-Hard)	Accuracy Gap (r_code - r_only)	0.00	-0.08	-0.08

Experiment Figures

Bar chart comparing 'Reasoning-only' vs 'Reasoning-with-code' accuracy across difficulty levels (Easy to Hard).

Main Takeaways

Single-agent systems exhibit a 'cognitive load interference' where interleaving tool use with reasoning degrades the quality of the reasoning steps themselves.
The performance degradation from tool integration is most pronounced on medium-difficulty problems, where the model struggles to balance problem formulation with execution mechanics.
MSARL's decoupled architecture is designed to address this by offloading the 'interpretation' burden to a Helper agent, allowing the Reasoner to focus on strategy.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals
Large Language Models (LLMs) and Tool Use
Proximal Policy Optimization (PPO) concepts

Key Terms

MSARL: Multi-Small-Agent Reinforcement Learning—the proposed framework decoupling reasoning and tool interpretation agents

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of samples for the same input to stabilize training

TIR: Tool-Integrated Reasoning—systems where LLMs utilize external tools (like code interpreters) to solve problems

SFT: Supervised Fine-Tuning—training models on labeled datasets before applying reinforcement learning

Cognitive Load: The mental effort required to process information; here, the conflict between high-level logic and low-level code syntax management

Code Sandbox: An isolated environment for safely executing code generated by the model

Nucleus Sampling: A text generation method (Top-p) where the next token is chosen from the smallest set of top tokens whose cumulative probability exceeds p

OR: Outcome Reward—reward given only at the end of a task based on the final result

PRM: Process Reward Model—a model that evaluates the correctness of intermediate reasoning steps