KERAG: Knowledge-Enhanced Retrieval-Augmented Generation for Advanced Question Answering

📝 Paper Summary

Modularized RAG pipeline Graph-based RAG pipeline

KERAG improves Knowledge Graph Question Answering by retrieving broader entity-level neighborhoods instead of rigid paths, then using a fine-tuned Chain-of-Thought LLM to filter and summarize the answer.

Core Problem

Traditional Semantic Parsing (SP) for KGQA retrieves only strictly necessary triples, leading to low coverage (high miss rates) due to rigid schemas and parsing errors.

Why it matters:

Rigid SP-based methods fail when natural language questions are ambiguous or do not perfectly match the KG schema.
Existing LLM-based KGQA methods that generate retrieval paths still suffer from low recall (missing answers) because they only explore a few specific paths rather than the broader context.
Head entities in KGs have massive neighborhoods (up to 2M triples), creating noise that overwhelms standard RAG summarizers.

Concrete Example: For the query 'Which books written by J.K. Rowling are related to magic?', a standard SP approach generates a SPARQL query looking for a specific `:topic :Magic` triple. If the KG records 'magic' in the `:description` attribute instead of `:topic`, the query returns empty results. KERAG retrieves the entire 'J.K. Rowling' neighborhood and uses an LLM to find 'magic' within the descriptions.

Key Novelty

Retrieval-Filter-Summarization over Entity Neighborhoods

Shifts from triple-level retrieval (finding exact paths) to entity-level retrieval (gathering broad subgraphs around topic entities) to maximize recall.
Interleaves multi-hop retrieval with schema-based filtering during planning to manage the volume of data without overwhelming the context window.
Uses a fine-tuned Chain-of-Thought (CoT) summarizer trained on synthetic data (generated by validating LLM reasoning against ground truth) to handle complex aggregation and reasoning.

Architecture

The overall pipeline of KERAG involving Planning, Retrieval, and Summarization.

Evaluation Highlights

Outperforms state-of-the-art KGQA methods (WikiSP, StructGPT, ToG) by ~7-8% in truthfulness on the Head2Tail benchmark.
Surpasses GPT-4o (Tool-use) by 21.4% in truthfulness on the CRAG benchmark (0.529 vs 0.315), primarily by reducing the miss rate from 59% to 6.6%.
Achieves 90.8% accuracy on Head2Tail, effectively solving simple KGQA questions while maintaining robustness across head, torso, and tail entities.

Breakthrough Assessment

8/10

Significant improvement in KGQA recall/coverage by abandoning rigid semantic parsing for broad neighborhood retrieval. The fine-tuning strategy for CoT summarization is practical and effective.

⚙️ Technical Details

Problem Definition

Setting: Knowledge Graph Question Answering (KGQA) using Retrieval-Augmented Generation

Inputs: Natural language question Q, Knowledge Graph K (access via SPARQL or API)

Outputs: Answer A

Pipeline Flow

Planning (Identify topic entity & domain → Schema exploration)
Retrieval (Iterative expansion: Get neighbors → Filter relations → Check termination)
Summarization (CoT reasoning over retrieved subgraph → Final Answer)

System Modules

Planner (Planning & Retrieval)

Identifies the topic entity/domain and iteratively decides which relations to fetch or filter based on schema

Model or implementation: Llama-3.1-70B-Instruct (or similar LLM)

Retriever (Planning & Retrieval)

Executes the plan to fetch actual KG data (via API or SPARQL)

Model or implementation: Programmatic interface (SPARQL endpoint or API)

Summarizer

Reason over the filtered sub-graph to generate the answer

Model or implementation: Llama-3.1-70B-Instruct (Fine-tuned for CoT)

Novel Architectural Elements

Iterative retrieval-filtering loop at the schema level: decisions to expand hops or filter relations are made before full data retrieval to prevent context overflow
Entity-centric retrieval scope: retrieves the whole relevant neighborhood rather than traversing a specific semantic path

Modeling

Base Model: Llama-3.1-70B-Instruct (also tested with Llama-3.1-8B-Instruct and DeepSeek-V3)

Training Method: Supervised Fine-Tuning (SFT) on synthetic CoT data

Adaptation: Full fine-tuning (implied by context of SFT description)

Training Data:

1. Prompt LLM to generate CoT reasoning + Answer for training queries.
2. Validate generated answer against ground truth.
3. If Correct: Use (Question, CoT + Answer) as training data.
4. If Wrong: Use (Question, Standard Prompt + Ground Truth) as training data.
Datasets: CRAG validation set used as training; Head2Tail splits used.

Compute: Inference latency: 8.55s (CRAG), 3.18s (Head2Tail)

Comparison to Prior Work

vs. ToG: KERAG retrieves whole entity neighborhoods (recall 0.952) vs. ToG's path-based search (recall 0.844), improving coverage.
vs. StructGPT: KERAG uses a retrieval-filter-summarize paradigm rather than iterative interface navigation, resulting in lower latency and higher accuracy.
vs. Semantic Parsing (WikiSP): KERAG avoids rigid logical form generation, replacing it with LLM-based filtering of broader subgraphs.

Limitations

Evaluation limited to 6 specific datasets (CRAG, Head2Tail, QALD-10, WebQSP, AdvHotpotQA, CWQ).
Risk of error propagation in the multi-stage pipeline (e.g., entity linking errors).
Performance depends on the quality of the underlying Entity Linking (currently uses a standard entity linker).
Latency is higher than pure semantic parsing approaches due to LLM usage in planning and summarization.

Reproducibility

Code: https://github.com/ysunbp/KERAG

Code and data are publicly available at https://github.com/ysunbp/KERAG. The paper uses public benchmarks (CRAG, Head2Tail, QALD-10, etc.). Validated on Llama-3.1 and GPT-4o.

📊 Experiments & Results

Evaluation Setup

Open-domain KGQA using both API-based (CRAG) and SPARQL-based (Head2Tail, others) access.

Benchmarks:

CRAG (KDD Cup 2024) (API-based KGQA)
Head2Tail (SPARQL-based KGQA (DBPedia))
QALD-10-en (Complex KGQA)
WebQSP (KGQA)
CWQ (Complex Web Questions)
AdvHotpotQA (Adversarial Multi-hop QA)

Metrics:

Accuracy (A)
Hallucination Rate (H)
Miss Rate (M)
Truthfulness (T = A - H)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on CRAG (API-based) showing KERAG surpassing both LLM baselines and competition winners.
CRAG	Truthfulness	0.315	0.529	+0.214
CRAG	Truthfulness	0.458	0.529	+0.071
Performance on Head2Tail (SPARQL-based) comparing against state-of-the-art KGQA specialized models.
Head2Tail	Truthfulness	0.790	0.860	+0.070
Head2Tail	Truthfulness	0.770	0.860	+0.090
Ablation studies on CRAG demonstrating the contribution of each component.
CRAG	Truthfulness	0.529	0.474	-0.055
CRAG	Truthfulness	0.529	0.453	-0.076
CRAG	Truthfulness	0.529	0.301	-0.228

Experiment Figures

Comparison of traditional SP-based approach vs. KERAG rationale on a 'magic' book query.

The data generation process for fine-tuning the CoT summarizer.

Main Takeaways

Broad entity-level neighborhood retrieval significantly improves recall compared to path-based or tool-use methods, reducing miss rates dramatically (e.g., from 59% to 6.6% on CRAG).
Fine-tuned Chain-of-Thought (CoT) summarization is the most critical component, preventing the model from hallucinating or refusing to answer when faced with large amounts of retrieved data.
The method is robust across Entity popularity (Head, Torso, Tail), showing stable performance where standard LLMs degrade on Tail entities.
Iterative filtering at the schema level effectively manages the 'knowledge overloading' problem inherent in retrieving full neighborhoods.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Graphs (structure, entities, relations)
Retrieval-Augmented Generation (RAG)
Semantic Parsing (SP) for KGQA
Chain-of-Thought (CoT) prompting

Key Terms

KGQA: Knowledge Graph Question Answering—systems that answer natural language questions by querying a structured Knowledge Graph

Semantic Parsing: The process of converting a natural language question into a structured logical query (like SPARQL) to execute against a database

SPARQL: A standard query language for graph databases (RDF), used to retrieve specific triples

Entity Neighborhood: The set of all direct relations and connected entities/attributes surrounding a specific node (entity) in the graph

Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer

Truthfulness: A metric defined in the paper as Accuracy minus Hallucination Rate (T = A - H), penalizing incorrect answers

Head/Torso/Tail entities: Categorization of entities based on their popularity/connectivity in the graph (Head = most popular, Tail = least)

SFT: Supervised Fine-Tuning—updating a pre-trained model on a specific dataset to improve performance on a task