University of Oslo,
Bosch Center for AI,
University of Stuttgart,
University of Oxford,
The University of Manchester,
University of Southampton
arXiv, 4/2025
(2025)
RAGKGQABenchmark
📝 Paper Summary
Graph-based RAG pipeline
This paper systematically evaluates three KG-RAG methods (RoG, ToG, G-Retriever) by simulating KG incompleteness via random deletion and reasoning path disruption, finding that while they outperform retrieval-free baselines, they are highly sensitive to missing direct evidence.
Core Problem
Existing KG-RAG evaluations typically assume complete Knowledge Graphs (KGs) where answers are directly inferable, failing to reflect real-world scenarios where KGs are often incomplete.
Why it matters:
KGs in practice are often sparse and missing critical links.
Current benchmarks do not test whether KG-RAG methods can reason over incomplete data or if they fail brittlely when direct evidence is missing.
Key Novelty
Systematic Robustness Evaluation of KG-RAG under Incompleteness
Introduces specific deletion strategies: 'Random Triple Deletion' (general sparsity) and 'Reasoning Path Disruption' (targeting critical reasoning chains).
Assesses whether LLMs can compensate for missing KG links using their internal parametric knowledge or alternative paths.
Architecture
An illustration of the 'Reasoning Path Disruption' concept using Justin Bieber's brother example, showing how removing a direct link forces (or fails to force) the model to use an alternative multi-hop path.
Evaluation Highlights
RoG on WebQSP: Accuracy drops from 76.75% (no deletion) to 72.15% (20% random deletion) and sharply to 65.43% (-14.7%) under reasoning path disruption.
ToG on WebQSP: Accuracy drops from 44.19% to 41.61% (20% random deletion) and to 38.25% under reasoning path disruption.
G-Retriever on WebQSP: Accuracy drops from 53.43% to 50.88% (20% random deletion) and to 49.36% under reasoning path disruption.
Despite drops, all methods still outperform 'No Retrieval' baselines (e.g., RoG no-retrieval is 50.46%).
Models often fail to switch to alternative valid reasoning paths when the primary/shortest path is broken.
Breakthrough Assessment
3/10
It is a solid evaluation paper that highlights a weakness in current methods, but it does not propose a new method to solve the problem, serving primarily as a diagnostic study.
⚙️ Technical Details
Pipeline Flow
Select KG-RAG method (RoG, ToG, or G-Retriever).
Apply deletion strategy to the Knowledge Graph (Random Deletion or Path Disruption).
Run the KG-RAG method on the modified KG.
Evaluate Accuracy and Hits@1 against ground truth answers.
System Modules
KG Deletion Module
Simulate incompleteness.
Model or implementation: N/A (Algorithmic)
KG-RAG Model
Retrieve and Generate Answer.
Model or implementation: Varied (RoG, ToG, G-Retriever)
📊 Experiments & Results
Evaluation Setup
Perturbation analysis on standard KGQA benchmarks.
Benchmarks:
WebQuestionsSP (WebQSP) (KGQA)
Complex WebQuestions (CWQ) (KGQA)
Metrics:
Accuracy
Hits
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
WebQSP
Accuracy
76.75
65.43
-11.32
WebQSP
Accuracy
76.75
72.15
-4.60
CWQ
Accuracy
57.49
52.95
-4.54
WebQSP
Accuracy
76.75
50.46
-26.29
Main Takeaways
KG-RAG methods are sensitive to incomplete knowledge; disrupting specific reasoning paths causes significant drops.
Even with 20% of the KG deleted, KG-RAG still outperforms retrieval-free LLMs, proving the residual value of incomplete KGs.
Current models struggle to adapt by finding alternative reasoning paths when the most obvious one is missing.