TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools

📝 Paper Summary

Multi-call tool use with flexible plan Agentic RAG pipeline

TxAgent is a fine-tuned LLM agent that personalizes therapeutic decisions by iteratively retrieving knowledge from 211 verified biomedical tools and generating transparent reasoning traces.

Core Problem

General-purpose LLMs lack real-time access to updated biomedical knowledge, hallucinate medical facts, and struggle with multi-step reasoning required for personalized treatment plans.

Why it matters:

Retraining LLMs for new drugs (like those approved in 2024) is computationally expensive and suffers from catastrophic forgetting
Precision medicine requires integrating diverse factors (genetics, comorbidities, drug interactions) that single-step RAG cannot handle effectively
Existing tool-use models fail to manage large tool spaces (hundreds of tools) and often cannot recover from failed initial tool calls

Concrete Example: When asked for the dosage of 'Kisunla' (approved in 2024, post-training cutoff), a standard LLM hallucinates or pleads ignorance. TxAgent recognizes the knowledge gap, calls 'get_dosage', retrieves FDA records, and synthesizes a correct answer.

Key Novelty

TxAgent: Full-stack agent optimization for therapeutic reasoning

Combines a specialized retrieval module (ToolRAG) to dynamically select from 211 tools with a fine-tuned reasoning agent that generates step-by-step plans
Uses a multi-agent data synthesis pipeline (QuestionGen, TraceGen) to create a massive instruction-tuning dataset (378k samples) derived from FDA labels and knowledge graphs

Architecture

Overview of the TxAgent framework including the ToolUniverse, ToolRAG retrieval, and the iterative reasoning loop.

Evaluation Highlights

92.1% accuracy on open-ended DrugPC reasoning tasks, outperforming GPT-4o by 25.8% and Llama-3.1-70B-Instruct by 39.3%
Achieves variance of <0.01 when testing across brand names, generic names, and descriptions, compared to GPT-4o's variance of 9.96
Outperforms DeepSeek-R1 (671B) by 7.5% on the open-ended TreatmentPC personalized medicine benchmark despite being a 8B parameter model

Breakthrough Assessment

8/10

Significant advance in domain-specific agentic reasoning. Effectively solves the 'too many tools' problem via retrieval and demonstrates superior generalization across drug terminologies compared to frontier models.

⚙️ Technical Details

Problem Definition

Setting: Therapeutic Question Answering with Tool Use

Inputs: Natural language therapeutic question (e.g., treatment recommendation, drug interaction check)

Outputs: Answer grounded in external evidence with a multi-step reasoning trace

Pipeline Flow

User Query -> TxAgent (LLM) -> ToolRAG (Tool Retrieval) -> Tool Selection -> External Tool Execution -> Observation -> TxAgent (Reasoning Update) -> Final Answer

System Modules

TxAgent LLM

Orchestrates reasoning, generates thoughts, and decides which tools to call

Model or implementation: Llama-3.1-8B-Instruct (Fine-tuned)

ToolRAG

Retrieves the most relevant tool definitions from the ToolUniverse based on the agent's intent

Model or implementation: ML-based retrieval system (specific architecture not detailed, likely dense retriever)

ToolUniverse

Executes API calls to sources like openFDA and Open Targets

Model or implementation: Collection of 211 Python functions wrapping APIs

Novel Architectural Elements

Integration of ToolRAG directly into the agent's loop to dynamically swap the available toolset during inference, allowing scale to 211+ tools

Modeling

Base Model: Llama-3.1-8B-Instruct

Training Method: Supervised Fine-Tuning (Instruction Tuning)

Trainable Parameters: Full model (implied via standard fine-tuning description)

Training Data:

TxAgent-Instruct: 378,027 samples total
Derived from 85,340 multi-step reasoning traces
Generated by TraceGen multi-agent system (Helper, Tool Provider, Solver agents)
Data sources: FDA labels (since 1939), PrimeKG, Open Targets

Key Hyperparameters:

base_model: Llama-3.1-8B-Instruct

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolACE/WattTool: TxAgent uses ToolRAG to handle large toolsets (211 vs limited context) and is explicitly trained for multi-step iterative reasoning rather than single-turn calling
vs. DeepSeek-R1: TxAgent grounds reasoning in real-time tool outputs rather than internal weights, reducing hallucinations on new drugs
vs. GPT-4o: TxAgent is specialized for biomedical syntax, showing negligible variance across brand/generic names where GPT-4o falters

Limitations

Dependency on external API availability and uptime (openFDA, Open Targets)
Performance on non-FDA-approved or experimental drugs not covered by the connected APIs is untested
Reliance on the quality of the synthetic training data (TraceGen outputs)

Reproducibility

Code availability is not explicitly provided in the text. The paper mentions datasets (TxAgent-Instruct) and benchmarks (DrugPC, etc.) but does not provide a repository URL. ToolUniverse relies on public APIs (openFDA, Open Targets).

📊 Experiments & Results

Evaluation Setup

Evaluated on 5 new benchmarks covering drug reasoning and personalized treatment, using drugs approved in 2024 (unseen during training).

Benchmarks:

DrugPC (Drug reasoning (11 tasks like dosage, safety)) [New]
BrandPC/GenericPC/DescriptionPC (Robustness to naming variations and descriptions) [New]
TreatmentPC (Personalized treatment recommendation) [New]

Metrics:

Accuracy (Multiple-choice)
Accuracy (Open-ended generation followed by choice selection)
Statistical methodology: Variance reported for robustness checks; no formal significance tests reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on DrugPC (general drug reasoning) shows TxAgent outperforming all baselines, particularly in open-ended settings.
DrugPC (Open-ended)	Accuracy	66.3	92.1	+25.8
DrugPC (Multiple-choice)	Accuracy	76.4	93.8	+17.4
Robustness experiments across drug naming conventions (Brand, Generic, Description) reveal TxAgent's stability compared to baselines.
BrandPC/GenericPC/DescriptionPC	Variance	9.96	0.01	-9.95
DescriptionPC	Accuracy (Drug ID + Answer)	48.2	56.5	+8.3
Personalized treatment results (TreatmentPC) demonstrate superiority over reasoning-specialized models.
TreatmentPC (Open-ended)	Accuracy	67.5	75.0	+7.5
TreatmentPC (Open-ended)	Accuracy	49.6	75.0	+25.4

Experiment Figures

Radar charts and bar plots showing accuracy across 11 DrugPC tasks.

Performance on TreatmentPC and case studies of reasoning traces.

Main Takeaways

Tool-augmented reasoning significantly outperforms pure LLM reasoning for biomedical tasks, especially for recent knowledge (2024 drugs).
Goal-oriented tool selection via ToolRAG allows the agent to effectively utilize a large toolbox (211 tools) where standard tool-use models fail due to context limits.
Fine-tuning on multi-step reasoning traces (TxAgent-Instruct) enables the model to recover from failed tool calls and refine queries, unlike baseline tool-use models that often fail after single attempts.
The approach generalizes exceptionally well across different drug representations (brand vs. generic), addressing a known fragility in general-purpose LLMs.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of LLM tool use / function calling
Knowledge of Retrieval-Augmented Generation (RAG)
Familiarity with biomedical entities (drugs, phenotypes, targets)

Key Terms

ToolRAG: A retrieval module within TxAgent that selects relevant tools from a large library based on the current reasoning context

TxAgent-Instruct: A synthetic dataset of 378,027 samples used to fine-tune the agent, containing reasoning traces and tool calls

openFDA: An API providing access to public FDA data, including drug labels and adverse event reports

Open Targets: A platform for therapeutic target identification and validation, linking diseases to targets

DrugPC: A new benchmark evaluating drug reasoning across 11 tasks like dosage, warnings, and pharmacology

TreatmentPC: A new benchmark assessing personalized treatment recommendations considering patient specifics

Instruction Tuning: Fine-tuning a pre-trained language model on a dataset of (instruction, output) pairs to improve its ability to follow commands

GiveAnswer: A special tool implemented in TxAgent that the model must call to finalize its reasoning process and deliver the answer