Evaluation Setup
LLM agents solve modified GSM8K problems by searching a vector database for missing premises.
Benchmarks:
- GSM-Agent-Full (Agentic Math Reasoning) [New]
- GSM-Agent-Medium (Agentic Math Reasoning) [New]
- GSM-Agent-Small (Agentic Math Reasoning) [New]
Metrics:
- Accuracy (Exact Match of numerical answer)
- Revisit Ratio (proportion of tool calls returning to a previously visited node)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance gap analysis highlights the degradation of reasoning capabilities when moving from a static context (all info present) to an agentic context (info must be searched). |
| GSM-Agent |
Accuracy |
100.0 |
67.0 |
-33.0
|
| GSM-Agent |
Accuracy Drop |
0.0 |
-80.0 |
-80.0
|
Main Takeaways
- Agentic reasoning is significantly harder than static reasoning: even frontier models like GPT-5 lose ~33% accuracy when forced to search for premises they can easily compute with.
- The 'Revisit' pattern (returning to a document to verify or re-read) is the strongest predictor of success in agentic tasks, yet is often missing in current models.
- Tool-augmented test-time scaling (adding tools to encourage revisiting) outperforms simple interaction-round scaling (just giving the agent more turns).