| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main comparison on MultiTQ dataset showing PoK outperforming both traditional TKGQA methods and recent LLM-based approaches. | ||||
| MultiTQ | Hits@1 | 76.5 | 77.9 | +1.4 |
| MultiTQ | Hits@1 | 70.2 | 77.9 | +7.7 |
| MultiTQ | Hits@1 | 37.9 | 77.9 | +40.0 |
| Results on TimeQuestions dataset, where PoK significantly outperforms graph-based and generative baselines. | ||||
| TimeQuestions | Hits@1 | 58.4 | 83.2 | +24.8 |
| TimeQuestions | Hits@1 | 60.5 | 83.2 | +22.7 |
| Ablation studies validating the necessity of each component in the PoK framework. | ||||
| MultiTQ | Hits@1 | 71.3 | 77.9 | +6.6 |
| MultiTQ | Hits@1 | 32.0 | 77.9 | +45.9 |
| Performance on complex Timeline datasets showing massive improvements over GPT-4o. | ||||
| Timeline-ICEWS | Hits@1 (Complex) | 37.6 | 68.3 | +30.7 |