| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| MAmmoTH models significantly outperform the leading open-source math model, WizardMath, on the challenging MATH dataset across different scales. | ||||
| MATH | Accuracy | 10.7 | 35.2 | +24.5 |
| MATH | Accuracy | 14.0 | 36.5 | +22.5 |
| MATH | Accuracy | 42.5 | 44.6 | +2.1 |
| Ablation studies confirm that combining CoT and PoT data yields better overall performance than either alone. | ||||
| Average (9 datasets) | Accuracy | 32.6 | 47.9 | +15.3 |
| Average (9 datasets) | Accuracy | 41.8 | 47.9 | +6.1 |