Evaluation Setup
Zero-shot evaluation on math problems using greedy decoding (unless specified)
Benchmarks:
- MATH (Competition-level mathematics)
- GSM8k (Grade school math word problems)
- GSM-Hard (Harder version of GSM8k)
- SVAMP (Math word problems with varying structures)
- TabMWP (Tabular math problems)
Metrics:
- Accuracy (Exact Match after rounding/parsing)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance comparisons on the competition-level MATH dataset, showing ToRA's superiority over baselines. |
| MATH |
Accuracy |
42.5 |
50.8 |
+8.3
|
| MATH |
Accuracy |
22.7 |
44.6 |
+21.9
|
| Generalization capabilities on GSM8k and tabular tasks. |
| GSM8k |
Accuracy |
80.4 |
84.3 |
+3.9
|
| TabMWP |
Accuracy |
49.8 |
74.0 |
+24.2
|
| Ablation study on Output Space Shaping strategies (Sampling and Correction). |
| MATH |
Accuracy |
46.0 |
50.8 |
+4.8
|
Main Takeaways
- Interleaving code and text (ToRA format) consistently outperforms Rationale-only (CoT) and Program-only (PAL) approaches across both LLaMA-2 and GPT-4 backbones.
- Output Space Shaping (adding corrected trajectories) provides significant gains (up to 4.5% absolute) without requiring additional external data.
- ToRA-Code models (trained on CodeLLaMA) outperform ToRA models (trained on LLaMA-2) by ~5%, indicating the value of code-pretrained backbones for tool-use agents.
- Analysis shows different tool usage patterns per subtopic: Algebra relies heavily on SymPy solvers, while Number Theory relies on algorithmic loops (gcd, lcm).