| Benchmark | Metric | Baseline | This Paper | ฮ |
|---|---|---|---|---|
| NeuroPath significantly outperforms graph-based and iterative baselines on retrieval metrics across all three datasets. | ||||
| MuSiQue | Recall@2 | 41.8 | 48.0 | +6.2 |
| 2WikiMultiHopQA | Recall@2 | 62.5 | 77.2 | +14.7 |
| HotpotQA | Recall@2 | 65.3 | 75.6 | +10.3 |
| QA performance shows strong gains on complex datasets (MuSiQue, 2Wiki) but competitive/lower performance on HotpotQA due to shortcut effects. | ||||
| MuSiQue | F1 | 39.1 | 44.3 | +5.2 |
| 2WikiMultiHopQA | F1 | 55.3 | 73.2 | +17.9 |
| MuSiQue | Recall@5 | 45.3 | 62.7 | +17.4 |