Evaluation Setup
Evaluated on both internal datasets (TAPO-easy/hard) and public benchmarks (MATH, GPQA Diamond, GSM8K, NQ).
Benchmarks:
- TAPO-easy-60K (Mixed Math & Fact Reasoning) [New]
- TAPO-hard-18K (Complex Math & Multi-hop Reasoning) [New]
- MATH (Mathematical Reasoning)
- GPQA Diamond (Graduate-Level Reasoning)
- GSM8K (Grade School Math)
Metrics:
- Accuracy / Pass Rate
- Average Tool Calls (Efficiency)
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- TAPO significantly improves performance on both knowledge-intensive and computational tasks compared to baselines.
- The method effectively prevents reward hacking; while baselines often increase tool usage without accuracy gains, TAPO maintains efficient tool use.
- Generalization is improved: models trained with TAPO do not suffer catastrophic forgetting on standard math benchmarks while gaining search capabilities.