SciAgent: Tool-augmented Language Models for Scientific Reasoning

📝 Paper Summary

Tool-augmented scientific reasoning Agentic data synthesis Tool retrieval

SciAgent enhances scientific reasoning by shifting LLMs from omniscient solvers to proficient tool-users, trained on a large-scale synthetic corpus of math functions and evaluated on a new multi-domain benchmark.

Core Problem

Scientific reasoning requires both specialized domain knowledge and calculation skills, but annotated data is scarce and fine-tuning models for every new domain is prohibitively expensive.

Why it matters:

Current LLMs (even GPT-4) struggle with scientific reasoning, achieving only ~35-50% accuracy on benchmarks like SciBench and TheoremQA.
Purely data-driven approaches require expensive expert annotation for every new scientific field.
Existing methods lack a scalable way to teach LLMs how to use pre-existing scientific tools (functions) rather than memorizing all knowledge.

Concrete Example: To solve a physics problem about Malus' law, a model must know the specific formula and perform precise calculations. Standard LLMs often hallucinate the formula or fail the arithmetic. SciAgent retrieves a correct Python function for Malus' law from a toolset and executes it to get the exact answer.

Key Novelty

MathFunc Corpus & SciAgent Framework

Constructs 'MathFunc', a synthetic training corpus where GPT-4 generates reusable functions, plans, and tool-augmented solutions for math problems.
disentangles tool creation from solution generation: uses a cross-retrieval strategy to fetch generalized functions from other problems, preventing the model from learning ad-hoc, problem-specific shortcuts.
Trains a four-stage agent (Planning, Retrieval, Action, Execution) that explicitly plans before retrieving tools, improving relevance and reasoning structure.

Architecture

The inference pipeline of SciAgent.

Evaluation Highlights

+13.4% absolute accuracy for SciAgent-Mistral-7B over comparable open-source models (e.g., Chameleon, ToolAlpaca) on the new SciToolBench.
SciAgent-DeepMath-7B outperforms ChatGPT (gpt-3.5-turbo) by a large margin on SciToolBench.
SciAgent-DeepMath-7B achieves 46.61% on SciToolBench, surpassing LLaMA-2-70B (26.05%) significantly despite being 10x smaller.

Breakthrough Assessment

7/10

Strong contribution in synthetic data generation for tool use and a solid new benchmark. The performance gains are significant for 7B models, though reliance on GPT-4 for data generation is a standard but limiting factor.

⚙️ Technical Details

Problem Definition

Setting: Tool-augmented scientific reasoning where an agent must solve a question q using a domain-specific toolset F_D

Inputs: A scientific question q and a large set of documented Python functions F_D

Outputs: A final answer a_q derived by interleaving natural language reasoning and code execution

Pipeline Flow

Planner: Generates high-level plan
Retriever: Selects relevant tools
Actor: Generates code solution using tools
Executor: Runs code to get answer

System Modules

Planning Module

Decompose the problem into a high-level plan to guide retrieval

Model or implementation: Fine-tuned 7B LLM (shared weights with Action)

Retrieval Module

Retrieve top-k relevant functions from the toolset based on question and plan

Model or implementation: Dense Retriever (based on bge-base-en-v1.5)

Action Module

Generate the solution text interleaved with Python code blocks that call retrieved functions

Model or implementation: Fine-tuned 7B LLM (shared weights with Planning)

Executor

Execute the generated Python code to obtain the final answer

Model or implementation: Python Interpreter

Novel Architectural Elements

Cross-retrieval data construction: A pipeline design choice where training data is built by retrieving tools from *other* questions to force generalization, rather than using the tools generated *for* the current question.

Modeling

Base Model: Mistral-7B, LLaMA-2-7B, DeepMath-7B

Training Method: Supervised Fine-Tuning (Imitation Learning)

Adaptation: Full fine-tuning

Trainable Parameters: All parameters (7B)

Training Data:

MathFunc corpus: 31,375 samples derived from MATH training set
Includes 6,229 tool-use samples and 24,946 tool-free PoT samples
5,981 generalized Python functions in the toolset

Key Hyperparameters:

learning_rate: 5e-6
batch_size: 16
epochs: 2
+ 1 more
scheduler: cosine

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolAlpaca: SciAgent focuses on scientific/math tools rather than general API calls and uses a dense retriever for large toolsets.
vs. ToRA: SciAgent introduces a separate planning stage and retrieves from a large external toolset rather than just generating code.
vs. Chameleon: SciAgent fine-tunes the model to use tools autonomously rather than relying on few-shot prompting/orchestration.

Limitations

Dependency on GPT-4 for synthetic data generation raises costs and reproducibility issues.
The toolset is static; the agent cannot define new tools on the fly during inference (though it can write code).
Performance still lags behind GPT-4 in some complex scenarios.

Reproducibility

The paper describes the data generation pipeline in detail (prompts in Appendix). Code availability is not explicitly provided in the main text but links to a project page or repo are common in such submissions (none explicitly found in text).

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on scientific reasoning tasks with provided domain toolsets.

Benchmarks:

SciToolBench (Scientific Question Answering with Tools) [New]
CREATOR-Challenge (subset) (Math/Tool reasoning)

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on the newly constructed SciToolBench across 5 domains (Math, Physics, Chem, EECS, Finance). SciAgent variants consistently outperform baselines of similar size.
SciToolBench	Accuracy	18.34	39.95	+21.61
SciToolBench	Accuracy	21.61	39.95	+18.34
SciToolBench	Accuracy	36.21	46.61	+10.40
SciToolBench	Accuracy	26.05	46.61	+20.56
Ablation studies on the components of SciAgent (Planning and Retrieval) using the SciToolBench-Math subset.
SciToolBench (Math subset)	Accuracy	43.33	48.33	+5.00
SciToolBench (Math subset)	Accuracy	55.00	48.33	-6.67

Experiment Figures

The automatic data construction pipeline for MathFunc.

Main Takeaways

Fine-tuning on the MathFunc corpus effectively teaches models to use tools for scientific reasoning, bridging the gap between small open-source models and proprietary giants.
The explicit planning module is crucial; it improves retrieval accuracy and overall solution quality by structuring the reasoning process before tool selection.
SciAgent shows strong generalization across scientific domains (Physics, Chemistry, etc.) even though the training corpus (MathFunc) is primarily math-derived, validation of the 'tool-user' approach.

📚 Prerequisite Knowledge

Prerequisites

Language Model fine-tuning (SFT)
Tool learning / Tool-augmented generation
Dense retrieval (for selecting relevant tools)
Synthetic data generation pipelines

Key Terms

MathFunc: The novel training corpus constructed in this paper, containing ~30k samples of questions, plans, retrieved tools, and code-integrated solutions

SciToolBench: A new benchmark dataset created by the authors covering 5 scientific domains (Math, Physics, Chemistry, EECS, Finance) with domain-specific toolsets

Ad-hoc functions: Functions generated specifically for a single problem instance (e.g., hardcoding numbers) rather than being general-purpose reusable tools

Cross-retrieval: A data construction strategy where the solution for problem A is forced to use tools generated for problem B (or others), ensuring the model learns to select generalized tools rather than cheat with ad-hoc ones

Program-of-Thought (PoT): A reasoning format where the model generates executable code (Python) as the reasoning steps instead of just text

Rationale: Natural language explanation generated before or alongside code to explain the reasoning process