Communitykg-rag: Leveraging community structures in knowledge graphs for advanced retrieval-augmented generation in fact-checking

📝 Paper Summary

Graph-based RAG pipeline Modularized RAG pipeline

CommunityKG-RAG enhances zero-shot fact-checking by constructing a knowledge graph from articles, detecting community structures to identify relevant subgraphs, and converting these communities into textual context for LLMs.

Core Problem

Existing RAG systems often struggle with multi-hop reasoning and context integration because they retrieve fragmented text chunks that lack structural relationships, while direct KG-based methods (feeding triples) confuse LLMs trained on sequential text.

Why it matters:

LLMs suffer from hallucinations and outdated training data, jeopardizing fact-checking accuracy.
Standard RAG struggles when crucial information is buried in long texts or when retrieved contexts contain noise/contradictions.
Directly feeding Knowledge Graph triples (subject, relation, object) to LLMs is suboptimal because models are not trained to leverage such structured formats effectively.

Concrete Example: When verifying a claim requiring multi-hop reasoning, a standard RAG system might retrieve disparate sentences that don't explicitly link entities. CommunityKG-RAG instead retrieves a 'community' of interconnected entities (e.g., a subgraph of related political figures and events) and converts this structural context into natural language, enabling the LLM to see the full picture.

Key Novelty

Community-Centric Knowledge Graph Retrieval

Constructs a Knowledge Graph from fact-checking articles and uses the Louvain algorithm to detect 'communities' (clusters of densely connected entities).
Retrieves entire communities based on semantic similarity to the claim, rather than just individual sentences or triples.
Converts the retrieved graph communities back into natural language sentences before feeding them to the LLM, bridging the gap between structured knowledge and sequential language processing.

Architecture

Overview of the CommunityKG-RAG framework pipeline.

Evaluation Highlights

Outperforms the KAPING baseline by +3.45% in Accuracy on the MOCHEG dataset using Llama-2-7b.
Achieves higher accuracy (63.02%) compared to Semantic Retrieval (56.09%) and No Retrieval (51.13%) baselines.
Demonstrates that converting KG communities to sentences is superior to using raw triples, improving results significantly over triple-based methods.

Breakthrough Assessment

7/10

Novel integration of community detection in KGs for RAG. Effectively addresses the structure-vs-text gap in LLMs. Strong zero-shot performance, though evaluated on a single dataset type.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot fact-checking using a corpus of articles to classify claims.

Inputs: A claim c and a corpus of fact-checking articles P.

Outputs: A truthfulness label y (supported, refuted, or NEI).

Pipeline Flow

Preprocessing: Coreference Resolution → Relation Extraction → KG Construction
Indexing: Community Detection (Louvain) → Community Embedding
Inference: Claim Embedding → Community Retrieval → Sentence Selection → LLM Generation

System Modules

Coreference Resolver (Preprocessing & Construction)

Resolve entity ambiguities and cluster mentions referring to the same entity

Model or implementation: SpanBERT-based model (Lee et al. 2018)

Relation Extractor (Preprocessing & Construction)

Extract entities and relationships to build the graph

Model or implementation: REBEL (Cabot and Navigli 2021)

Community Detector

Partition the graph into densely connected communities

Model or implementation: Louvain algorithm

Community Retriever (Retrieval & Selection)

Retrieve relevant communities based on semantic similarity to the claim

Model or implementation: Sentence-BERT (for claim and community embedding comparison)

Sentence Selector (Retrieval & Selection)

Select specific sentences from the top communities

Model or implementation: Sentence-BERT

Generator

Classify the claim based on retrieved evidence

Model or implementation: LLaMa-2-7b

Novel Architectural Elements

Two-stage retrieval hierarchy: First retrieves structural communities (subgraphs), then filters for sentences within those communities.
Utilization of community embeddings (averaging node embeddings) as the retrieval index unit instead of individual documents or raw triples.

Modeling

Base Model: LLaMa-2-7b

Training Method: Zero-shot inference (no training of the LLM)

Key Hyperparameters:

delta: Not reported in the paper
lambda: Not reported in the paper

Compute: KG construction and community detection are performed once; zero-shot inference is efficient.

Comparison to Prior Work

vs. KAPING: Retrieves communities (subgraphs) and converts them to sentences rather than feeding raw triples; leverages community structure.
vs. Semantic Retrieval: Uses graph structure (communities) to define search space boundaries, ensuring contextually related entities are retrieved together.
vs. Graph-RAG [not cited in paper]: Similar to GraphRAG (Microsoft) in using community detection, but focuses specifically on zero-shot fact-checking and converting back to sentences for standard LLMs.

Limitations

Dependency on the quality of the underlying Knowledge Graph construction (REBEL) and coreference resolution.
Performance depends on the Louvain algorithm's ability to create meaningful communities.
Computational cost of initial graph construction and community detection might be high for very large corpora.
Evaluated on a limited set of benchmarks (primarily MOCHEG).

Reproducibility

Code availability is not provided. The paper uses public datasets (MOCHEG) and open-source models (LLaMa-2, REBEL, SpanBERT). Hyperparameters for delta and lambda (thresholds for filtering) are not explicitly listed in the main text.

📊 Experiments & Results

Evaluation Setup

Zero-shot fact-checking on the MOCHEG dataset.

Benchmarks:

MOCHEG (Fact-checking (Three-way classification: Supported, Refuted, NEI))

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MOCHEG	Accuracy	51.13	63.02	+11.89
MOCHEG	Accuracy	56.09	63.02	+6.93
MOCHEG	Accuracy	59.57	63.02	+3.45

Main Takeaways

Integrating Knowledge Graph community structures significantly improves retrieval relevance for fact-checking compared to standard semantic retrieval.
Converting retrieved graph knowledge back into natural language sentences is superior to feeding raw triples to the LLM (as done in KAPING).
The zero-shot framework is effective without requiring additional fine-tuning of the LLM.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Graphs (structure, triples)
Retrieval-Augmented Generation (RAG)
Community Detection algorithms (Louvain)
BERT embeddings

Key Terms

Knowledge Graph (KG): A structured representation of data using entities (nodes) and their relationships (edges).

Community Detection: Algorithms used to identify clusters of nodes in a graph that are more densely connected to each other than to the rest of the network.

Louvain algorithm: A heuristic method for extracting communities from large networks based on modularity optimization.

Modularity: A measure of the structure of networks or graphs which measures the strength of division of a network into modules (clusters).

Triple: The fundamental unit of a Knowledge Graph, consisting of (subject, predicate, object).

Zero-shot: The setting where the model performs the task without any specific training examples for that task.

Multi-hop reasoning: The ability to connect pieces of information from different sources or steps to arrive at a conclusion.

Coreference resolution: The task of finding all expressions that refer to the same entity in a text.