| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Comparative analysis on the Who&When benchmark demonstrates AgenTracer-8B's superiority over much larger models in both agent identification and step localization. | ||||
| Who&When (handcrafted) | Agent-level Accuracy (w/ Ground Truth) | 56.90 | 69.10 | +12.20 |
| Who&When (handcrafted) | Step-level Accuracy (w/ Ground Truth) | 17.24 | 20.68 | +3.44 |
| Who&When (automated) | Step-level Accuracy (w/o Ground Truth) | 29.52 | 37.30 | +7.78 |
| Evaluation on the internal TracerTraj test set across different domains (Code, Math, Agentic). | ||||
| TracerTraj-Agentic | Agent-level Accuracy (w/ Ground Truth) | 37.16 | 53.28 | +16.12 |
| MaAS + MATH-500 | Success Rate Improvement | Not reported in the paper | Not reported in the paper | +14.21 |