| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Performance on StepGame (Spatial Reasoning) showing significant gains over baselines, especially at higher hop counts (k=10). | ||||
| StepGame (k=10) | Accuracy | 48.3 | 69.6 | +21.3 |
| StepGame (k=10) | Accuracy | 65.1 | 69.6 | +4.5 |
| Performance on CLUTRR (Kinship Reasoning) demonstrating robustness. | ||||
| CLUTRR | Accuracy | 64.3 | 78.4 | +14.1 |
| Results on the newly constructed Chinese Kinship dataset, which is highly complex. | ||||
| Chinese Kinship | Accuracy | 70.0 | 83.6 | +13.6 |