Semantic Context for Tool Orchestration

📝 Paper Summary

Multi-call tool use with flexible plan RL-based Agentic RAG pipeline

Providing agents with semantic descriptions of tools (Semantic Context) rather than opaque indices enables faster learning, better generalization, and robust adaptation to changing toolsets.

Core Problem

Naive tool orchestration treats tools as abstract indices in a large discrete action space, leading to inefficient learning and catastrophic forgetting when the toolset changes.

Why it matters:

Modern agents face dynamic environments where APIs are frequently added or removed, causing index-based policies to fail
Standard reinforcement learning approaches scale poorly with large vocabulary sizes (action spaces), requiring impractical amounts of interaction data
Treating actions as opaque IDs discards valuable prior knowledge contained in API documentation and docstrings

Concrete Example: When a 'Data Analyzer' tool is removed and replaced by a semantically similar 'Stats Calculator', an index-based agent must relearn the new tool's utility from scratch. A semantic agent sees the similar description and immediately generalizes its previous experience.

Key Novelty

Semantic Context (SC) for Action Representation

Represent agent actions (tools) not as one-hot vectors, but as dense embeddings derived from their natural language descriptions (Semantic Context)
Use a shared linear reward model over these semantic features, allowing the agent to predict the utility of unseen or new tools based on their similarity to known ones
Implement a 'Filter-Reason-Act' (FiReAct) pipeline that uses semantic similarity to retrieve a small candidate set of tools before reasoning, scaling to thousands of actions

Architecture

Pseudocode for the FiReAct (Filter-Reason-Act) pipeline.

Evaluation Highlights

SC-LinUCB maintains near-optimal low regret (~100) while index-based LinUCB suffers orders-of-magnitude higher regret (>1000) in static settings
In dynamic environments with adding/removing tools, SC-LinUCB shows zero performance drop, whereas baselines suffer catastrophic forgetting and massive regret spikes
FiReAct pipeline with semantic context achieves ~90% accuracy on a 10,000+ tool benchmark, compared to ~75% for retrieval alone

Breakthrough Assessment

8/10

Provides a rigorous theoretical and empirical foundation for a widely used but under-analyzed practice (using tool descriptions). The connection between contextual bandits and LLM in-context learning is particularly insightful.

⚙️ Technical Details

Problem Definition

Setting: Contextual Bandit problem with a dynamic action space (Lifelong Semantic Context MDP)

Inputs: User query q_t and a set of available tools A_t = {a_1, ..., a_O} with descriptions D(a)

Outputs: Selected tool a_t to execute

Pipeline Flow

Input Query
Semantic Filtering (Retrieval of top-k candidates)
Reasoning (LLM selection using Semantic Context)
Action Execution

System Modules

Semantic Filter (Retrieval & Selection)

Retrieve a manageable subset of candidate tools from a large catalogue

Model or implementation: text-embedding-004

Reasoner / Policy (Retrieval & Selection)

Select the best tool from the candidate set based on query and tool descriptions

Model or implementation: Gemini 2.0 Flash

Novel Architectural Elements

Integration of semantic action embeddings directly into the Linear Bandit feature space (SC-LinUCB)
FiReAct topology: strictly separating large-scale semantic filtering from reasoning to handle 10k+ tool spaces

Modeling

Base Model: Gemini 2.0 Flash (for LLM experiments)

Training Method: Contextual Bandits (SC-LinUCB) and In-Context Learning (LLM)

Objective Functions:

Purpose: Minimize Cumulative Regret.

Formally: Sum over T of (max_a p_eff(a, q_t) - p_eff(a_t, q_t))

Key Hyperparameters:

alpha: 0.3 (SC-LinUCB static), 0.5 (SC-LinUCB dynamic)
temperature: 0.5 (LLM)
max_output_tokens: 500-1500 (LLM)

Compute: Experiments run on Colab free tier CPU

Comparison to Prior Work

vs. LinUCB: SC-LinUCB uses dense semantic features instead of orthogonal one-hot vectors, enabling transfer learning across tools
vs. Tool-RAG: Explicitly formalizes the 'semantic context' benefit theoretically via bandit regret bounds rather than just empirical tuning
vs. HAMMER: Focuses on the reasoning/selection stage with semantic context rather than just masking/suppression
+ 1 more
vs. ADAPT [not cited in paper]: ADAPT decomposes tasks for planning; this paper focuses on the atomic tool selection step with dynamic action spaces

Limitations

Theoretical regret bounds for non-stationary toolsets (dynamic A_t) are not formally derived, only empirically demonstrated
Depends on the quality of tool descriptions; poor descriptions may degrade performance
LLM experiments are specific to Gemini 2.0 Flash and prompt strategies; in-context learning guarantees are lacking

Reproducibility

Code: https://arxiv.org/pdf/2507.10820.pdf

Paper includes pseudocode for SC-LinUCB and FiReAct. Experimental details (query types, tool archetypes, phase structures) are described in appendices. Code URL not explicitly provided in abstract.

📊 Experiments & Results

Evaluation Setup

Contextual Bandit simulation and LLM In-Context Learning benchmarks

Benchmarks:

Synthetic Bandit Environment (Tool Selection) [New]
XLAM-based Benchmark (Tool Retrieval & Selection) [New]

Metrics:

Cumulative Regret
Average Return / Cumulative Reward
Tool Selection Accuracy (Top-1)
Statistical methodology: Results averaged over 15 runs (Bandits) or 5-7 trials (LLM). Shaded regions in plots indicate standard error.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Bandit experiments show SC-LinUCB (semantic) vastly outperforms LinUCB-OneHot (non-semantic) in both static and dynamic settings.
Synthetic Multi-Context	Cumulative Regret (log scale)	>1000	~100	-900 (approx)
Synthetic Continual Adaptation	Cumulative Regret at T=10000	>1000	~20	-980 (approx)
LLM experiments (Gemini 2.0) confirm that semantic context (Names + Descriptions) is critical for in-context learning.
LLM Tool Selection (Static)	Cumulative Reward	Low (near random)	High (near optimal)	Significant qualitative gap
XLAM-based (10,000 tools)	Tool Selection Accuracy	75%	90%	+15%

Experiment Figures

Cumulative Regret over time for SC-LinUCB vs LinUCB-OneHot in a dynamic environment with 4 phases (tools/queries changing).

Tool selection accuracy vs. number of total tools (up to 10,000), comparing retrieval only vs. FiReAct reasoning.

Main Takeaways

Semantic Context allows agents to generalize utility across tools: learning one tool informs the value of semantically similar tools.
In dynamic environments (tools added/removed), semantic agents adapt instantly (zero-shot transfer), while non-semantic agents suffer catastrophic forgetting.
For LLMs, 'Name Only' context can sometimes outperform full descriptions in highly dynamic settings due to reduced cognitive load, though 'Name+Description' is generally robust.
Filtering alone is insufficient for large toolsets; a reasoning step (FiReAct) significantly boosts accuracy by re-ranking retrieved candidates.

📚 Prerequisite Knowledge

Prerequisites

Contextual Bandits (LinUCB)
Reinforcement Learning basics (Regret, Return)
Large Language Models (In-context learning)
Retrieval-Augmented Generation (RAG)

Key Terms

Semantic Context (SC): The collection of natural language descriptions (e.g., docstrings) for all currently available tools, used to represent actions

SC-LinUCB: A variation of the LinUCB bandit algorithm that uses semantic embeddings of tool descriptions as action features

Regret: The difference between the total reward the agent could have gotten by acting optimally and the reward it actually received

Catastrophic Forgetting: The tendency of a neural network or learning algorithm to completely lose previously learned knowledge when learning new information

FiReAct: Filter-Reason-Act: A pipeline proposed in this paper that filters a large toolset via retrieval before using an LLM to reason and select the final tool

One-hot encoding: A representation where each item (tool) is a vector with a single '1' and all other zeros; implies no shared meaning between items

LinUCB: Linear Upper Confidence Bound—a bandit algorithm that assumes rewards are a linear function of context features and selects actions to maximize an upper confidence bound on the reward

In-context learning (ICL): The ability of LLMs to learn tasks from examples or instructions provided in the prompt without parameter updates

Effective noise: A term in regret bounds summarizing observation noise and model approximation error; lower effective noise implies faster learning