Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning

📝 Paper Summary

Agentic Reinforcement Learning (ARL) Post-training optimization Tool-use and Reasoning integration

The paper empirically proves that joint training of reasoning and tool-use in agents creates conflicting gradient updates, and proposes DART to disentangle these capabilities via separate LoRA modules.

Core Problem

Most Agentic RL methods jointly optimize reasoning and tool-use on shared parameters, assuming they are compatible, but this actually causes performance degradation due to interference.

Why it matters:

Improving tool-use often degrades reasoning (and vice versa) in a 'seesaw' phenomenon, limiting overall agent performance.
Orthogonal gradient directions between reasoning and tool tokens cause the shared model to update in a 'compromise' direction that is suboptimal for both.
Current paradigms rely on implicit assumptions of compatibility that have not been rigorously tested until now.

Concrete Example: When an agent attempts to solve a complex question requiring both logic (reasoning) and external data (tool-use), joint training might improve its ability to call the search API but simultaneously degrade its ability to synthesize the search results into a coherent answer, because the gradients for these two tasks push the weights in opposite directions.

Key Novelty

Disentangled Action-Reasoning Tuning (DART)

Routes reasoning tokens and tool-use tokens to separate, disjoint LoRA adapters while keeping the pre-trained backbone frozen.
Prevents gradient conflict by ensuring that updates for reasoning logic do not overwrite or interfere with updates for tool execution patterns.
Introduces a diagnostic framework (LEAS) inspired by variance-based statistics to mathematically quantify the interference between capabilities.

Architecture

Conceptual flow of DART vs Joint Training. Joint training updates one shared parameter set. DART routes tokens to specific 'Reasoning LoRA' or 'Tool LoRA' based on token type.

Evaluation Highlights

DART achieves an average Exact Match (EM) score improvement of +6.35% over joint-training baselines across seven benchmarks.
Matches the performance of specialized multi-agent systems (which use separate models) while using only a single model with lightweight adapters.
Empirical analysis using LEAS confirms a negative interaction (interference) between capabilities in the majority of test cases.

Breakthrough Assessment

7/10

Strong empirical evidence of a fundamental problem (gradient conflict) in standard Agentic RL, offering a simple but effective architectural solution (DART) that yields consistent gains.

⚙️ Technical Details

Problem Definition

Setting: Agentic Reinforcement Learning (ARL) where an agent generates a trajectory interleaving reasoning thoughts and tool actions.

Inputs: A query q requiring complex reasoning and external tool interaction.

Outputs: A trajectory containing reasoning tokens and tool-use tokens, ending in a final answer.

Pipeline Flow

Input Query -> Token Role Identification -> Router -> Specific LoRA Module -> Output Token

System Modules

Role-based Router

Identifies if the current token generation step corresponds to 'reasoning' or 'tool-use' based on trajectory tags.

Model or implementation: Rule-based function l(t)

Reasoning Adapter (Generation)

Processes tokens identified as reasoning steps.

Model or implementation: LoRA Adapter (specialized for reasoning)

Tool-use Adapter (Generation)

Processes tokens identified as tool executions.

Model or implementation: LoRA Adapter (specialized for tools)

Novel Architectural Elements

Token-level routing to disjoint LoRA adapters within a single generation pass, specifically splitting 'reasoning' vs 'tool-use' capabilities.

Modeling

Base Model: Pretrained LLM (Specific backbone not named in text, but methodology is general)

Training Method: Agentic Reinforcement Learning with Policy Gradient

Objective Functions:

Purpose: Maximize expected reward of the trajectory.

Formally: J(theta) = E[R(tau)] estimated via policy gradient sum of A(tau) * grad(log pi(ct|c<t)).
Purpose: Selectively update parameters based on token role.

Formally: Gradient masking m_t applied to the loss, where m_t activates only for the specific adapter's token type.

Adaptation: Disentangled LoRA (separate adapters for reasoning and tools)

Trainable Parameters: LoRA matrices A and B (rank r << d)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard ARL: DART separates parameters to avoid gradient conflict.
vs. Multi-Agent Systems: DART uses a single backbone with lightweight adapters, avoiding the cost/complexity of maintaining multiple full models.
vs. Router-driven LoRA (e.g., Mixture-of-LoRAs): DART uses hard, role-based routing specifically for the reasoning/tool dichotomy rather than learned soft routing [not cited in paper].

Limitations

Relies on the ability to clearly distinguish reasoning tokens from tool-use tokens (requires accurate routing function l(t)).
Does not address interference between different types of tools or different types of reasoning, only the high-level split.
Requires maintaining two sets of LoRA parameters during inference (though lightweight).

Reproducibility

Code availability is not provided in the paper text. Detailed LEAS implementation logic is described mathematically. Hyperparameters for specific experiments are referenced as being in Appendix D (not provided in this excerpt).

📊 Experiments & Results

Evaluation Setup

Tool-augmented Question Answering

Benchmarks:

NQ (Natural Questions) (Open-domain QA)
HotpotQA (Multi-hop QA)
Five other unspecified benchmarks (Tool-augmented QA)

Metrics:

Exact Match (EM) score
Interaction Coefficient (Lambda) from LEAS
Statistical methodology: Linear Effect Attribution System (LEAS) used to quantify interaction effects.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Average across 7 benchmarks	EM Score Improvement	Not reported in the paper	Not reported in the paper	+6.35%
NQ & HotpotQA (Analysis)	Interaction Coefficient Sign (Lambda_23)	0	< 0 (Negative)	Negative

Experiment Figures

Distribution of interaction coefficients (Lambda) vs correctness. Shows that most data points fall into the interference region (Lambda < 0).

Angle between gradients of different token types.

Main Takeaways

Joint optimization of reasoning and tool-use induces a 'seesaw' phenomenon where improving one hurts the other.
Gradients for reasoning and tool-use are nearly orthogonal, leading to conflicting updates in shared parameters.
DART successfully mitigates this interference by structurally separating the updates, yielding consistent performance gains.
Tasks with higher accuracy requirements tend to show stronger interference, suggesting that difficult tasks require distinct, non-conflicting internal representations.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradient)
Low-Rank Adaptation (LoRA)
Large Language Models (LLMs)
Gradient Descent dynamics

Key Terms

ARL: Agentic Reinforcement Learning—training LLMs to use tools and reason via reinforcement learning signals.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main model weights and trains small rank-decomposition matrices.

LEAS: Linear Effect Attribution System—a diagnostic framework proposed in this paper to decompose agent performance into individual capability effects and interaction terms.

Gradient Conflict: When the gradient vectors for two different tasks (e.g., reasoning vs. tool use) point in different, often orthogonal, directions, making joint optimization difficult.

Seesaw Phenomenon: A situation where improving one metric or capability causes a simultaneous decline in another.

Exact Match (EM): A strict evaluation metric where the generated answer must exactly match the ground truth.

Token Routing: The process of directing specific tokens (based on their role) to specific network modules or adapters.