GSM-Agent: Understanding Agentic Reasoning Using Controllable Environments

📝 Paper Summary

Agentic reasoning Reasoning benchmarks

GSM-Agent isolates agentic skills by hiding GSM8K premises in a searchable database, revealing that models struggle because they lack the ability to revisit information sources.

Core Problem

Current agent benchmarks entangle reasoning capabilities with domain knowledge or complex math, making it difficult to isolate 'agentic' skills like search and planning from static reasoning ability.

Why it matters:

Models like GPT-5 show high static reasoning performance but fail when required to actively search for information, indicating a deployment gap
Without clean separation, researchers cannot determine if an agent fails due to lack of knowledge, poor math skills, or inability to use tools effectively
Understanding specific agentic failure modes (like the inability to revisit nodes) is necessary to improve autonomous systems beyond simple interaction scaling

Concrete Example: In a standard GSM8K task, a model sees 'Alice bought 2 books for $5 each'. In GSM-Agent, the model sees only 'How much did Alice spend?' and must use a Search tool to find a document stating the price and quantity. Models often fail to retrieve the document despite being able to solve the math.

Key Novelty

GSM-Agent Benchmark and Agentic Reasoning Graph

Transforms static math word problems into agentic tasks by stripping premises from the prompt and hiding them in a generated, searchable document database (environment)
Introduces 'Agentic Reasoning Graph', a framework that clusters document embeddings to map continuous tool usage into discrete steps (Explore, Exploit, Revisit) for analysis

Architecture

Contrast between Static Reasoning (Standard GSM8K) and Agentic Reasoning (GSM-Agent)

Evaluation Highlights

Frontier model GPT-5 achieves only 67% accuracy on GSM-Agent, representing a ~33% absolute drop compared to its static reasoning performance
DeepSeek-V3 suffers a massive performance collapse, losing up to 80% accuracy in the agentic setting compared to the static setting
Analysis using the Agentic Reasoning Graph reveals a strong correlation between the 'revisit ratio' (returning to a previously found document) and overall task accuracy

Breakthrough Assessment

9/10

Cleverly repurposes a solved task (GSM8K) to isolate agentic overhead. The 'revisit' insight is a significant, interpretable finding about current LLM agent limitations.

⚙️ Technical Details

Problem Definition

Setting: Agentic Question Answering with Tool Use

Inputs: Question q (without premises)

Outputs: Numerical answer a_hat

Pipeline Flow

Agent receives Question (without premises)
Agent generates Search Query
Environment returns top-5 Documents (via Embedding Retrieval)
Agent processes Documents -> repeats Search or calculates
Agent outputs Final Answer

System Modules

Agent

Orchestrates the reasoning process, decides which search queries to issue, and performs final calculation

Model or implementation: Evaluated on various models (GPT-5, DeepSeek-V3, Claude-3.5-Sonnet)

Environment (Database)

Stores context-rich documents containing the premises required to solve the task

Model or implementation: Chroma DB with text-embedding-3-large

Novel Architectural Elements

Evaluation Framework: Agentic Reasoning Graph that maps tool traces to discrete graph nodes via embedding clustering to measure 'revisit' patterns
Benchmark Design: Controlled decomposition of single static problems into distributed documents requiring multi-step retrieval

Modeling

Base Model: Evaluated on GPT-5, DeepSeek-V3, Claude-3.5-Sonnet

Training Method: Inference-only evaluation (with proposed tool-augmented test-time scaling)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Li et al.: GSM-Agent focuses on end-to-end reasoning and tool use success, not just question generation
vs. Zhou et al.: GSM-Agent uses the *same* underlying math tasks for both static and agentic settings, allowing a precise apples-to-apples comparison of the 'agentic gap'
vs. ToolBench [not cited in paper]: Focuses on specific reasoning patterns (revisit) rather than just broad tool-use success rates

Limitations

Evaluation relies on proprietary models (GPT-5) which may not be accessible to all researchers
Reasoning graph construction depends on embedding quality; poor embeddings could misclassify 'revisits'
Focus is strictly on grade-school math; findings on 'revisit' importance might vary for more creative or open-ended tasks

Reproducibility

Code: https://github.com/GuoTianYu2000/GSM-Agent

Benchmark code and data construction scripts are publicly available at https://github.com/GuoTianYu2000/GSM-Agent. The paper uses 'GPT-5' which implies a future or non-public model context, limiting full reproducibility of specific result numbers reported for that model.

📊 Experiments & Results

Evaluation Setup

LLM agents solve modified GSM8K problems by searching a vector database for missing premises.

Benchmarks:

GSM-Agent-Full (Agentic Math Reasoning) [New]
GSM-Agent-Medium (Agentic Math Reasoning) [New]
GSM-Agent-Small (Agentic Math Reasoning) [New]

Metrics:

Accuracy (Exact Match of numerical answer)
Revisit Ratio (proportion of tool calls returning to a previously visited node)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance gap analysis highlights the degradation of reasoning capabilities when moving from a static context (all info present) to an agentic context (info must be searched).
GSM-Agent	Accuracy	100.0	67.0	-33.0
GSM-Agent	Accuracy Drop	0.0	-80.0	-80.0

Experiment Figures

Dataset Construction Pipeline

Main Takeaways

Agentic reasoning is significantly harder than static reasoning: even frontier models like GPT-5 lose ~33% accuracy when forced to search for premises they can easily compute with.
The 'Revisit' pattern (returning to a document to verify or re-read) is the strongest predictor of success in agentic tasks, yet is often missing in current models.
Tool-augmented test-time scaling (adding tools to encourage revisiting) outperforms simple interaction-round scaling (just giving the agent more turns).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Agentic vs. Static reasoning
Familiarity with GSM8K (Grade School Math 8K) dataset
Basic knowledge of vector embeddings and clustering

Key Terms

Static reasoning: Solving a problem where all necessary information (premises) is provided immediately in the prompt

Agentic reasoning: Solving a problem where information is missing and must be actively retrieved using tools (search, browsing) before inference

Agentic Reasoning Graph: A topological representation of an agent's search history, created by clustering document embeddings into nodes and mapping tool calls to these nodes

Revisit: The action of an agent returning to a specific information node (cluster of documents) it has previously accessed

GSM8K: Grade School Math 8K, a standard benchmark of 8,500 high-quality grade school math word problems

Frontier models: Leading state-of-the-art Large Language Models (e.g., GPT-5, Claude-3.5-Sonnet) mentioned in the paper context