Evaluation Setup
Tasks: Arithmetic (Math), Knowledge QA (LAMA), Temporal QA (TimeQA), Multilingual QA. Each task paired with a specific tool.
Benchmarks:
- GSM8K (Arithmetic Reasoning)
- SVAMP (Arithmetic Reasoning)
- LAMA (Knowledge Retrieval)
- TimeQA (Temporal Reasoning)
Metrics:
- Accuracy (Match with gold answer)
- Tool Usage Rate (Percentage of times tool was called)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Trice outperforms baselines on arithmetic reasoning tasks using a Calculator tool. |
| GSM8K |
Accuracy |
39.27 |
42.08 |
+2.81
|
| SVAMP |
Accuracy |
56.40 |
60.43 |
+4.03
|
| Trice demonstrates superior performance on Knowledge QA using a Search tool (Atlas). |
| LAMA |
Accuracy |
45.05 |
54.00 |
+8.95
|
| Ablation studies confirm the contribution of the RLEF stage (Stage II). |
| GSM8K |
Accuracy |
39.27 |
42.08 |
+2.81
|
Main Takeaways
- Selective tool use is superior to mandatory tool use: forcing models to use tools for everything (100% Tool baseline) often hurts performance compared to Trice.
- Execution feedback is effective: The RLEF stage consistently improves over simple behavior cloning, refining the decision boundary of when to use tools.
- Generalization: Trice works across different backbones (Alpaca, Vicuna, ChatGLM) and different task types (Math, QA, Translation).
- Analysis shows Trice reduces 'Over-Reliance' (using tools when not needed) and 'Insufficient Learning' (not using tools when needed) compared to baselines.