Evaluation Setup
KBQA on heterogeneous databases (Freebase, Wikidata, Movie KB)
Benchmarks:
- WebQuestionsSP (WebQSP) (KBQA on Freebase (1-hop, 2-hop))
- ComplexWebQuestions (CWQ) (Complex KBQA on Freebase (Conjunction, Composition, Comparative, Superlative))
- KQA Pro (Large-scale complex KBQA on Wikidata (9 question types))
- MetaQA (Multi-hop KBQA on Movie KB)
Metrics:
- F1 score
- Exact Match (EM)
- Random Hits@1
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparative performance on standard benchmarks. Note that for WebQSP and KQA Pro, the proposed method uses significantly less training data (~50/type) compared to baselines using full datasets (~3K-33K), yet remains competitive. |
| ComplexWebQuestions (CWQ) |
F1 |
61.1 |
69.1 |
+8.0
|
| MetaQA (3-hop) |
F1 |
87.0 |
90.7 |
+3.7
|
| WebQuestionsSP (WebQSP) |
F1 |
75.7 |
73.9 |
-1.8
|
| KQA Pro |
Accuracy |
90.55 |
75.35 |
-15.2
|
| Detailed breakdown by question type on CWQ showing specific strengths in reasoning-heavy categories. |
| ComplexWebQuestions (Comparative) |
F1 |
39.60 |
69.45 |
+29.85
|
| ComplexWebQuestions (Superlative) |
F1 |
54.12 |
68.08 |
+13.96
|
Main Takeaways
- Interactive-KBQA demonstrates that agentic interaction with tools can replace the need for massive labeled datasets in semantic parsing.
- Fine-tuning open-source models (Mistral-7B) on a small set of high-quality, human-corrected interaction traces can outperform closed-source models (GPT-4) and full-data baselines on complex query types.
- The method excels at 'reasoning-heavy' questions (comparatives, superlatives) where traditional one-shot semantic parsing struggles to capture the logic.
- The framework effectively unifies interaction logic across different KB structures (Freebase, Wikidata), proving robustness across schemas.