Knowledge Graph-based Retrieval-Augmented Generation for Schema Matching

📝 Paper Summary

Graph-based RAG pipeline Schema Matching

KG-RAG4SM improves schema matching by retrieving and ranking relevant subgraphs from large external knowledge graphs to augment LLM prompts with semantic context.

Core Problem

Traditional and LLM-based schema matching methods fail to resolve semantic ambiguities (e.g., acronyms, hidden duplicates) in complex scenarios due to a lack of domain knowledge and common sense.

Why it matters:

Integrating large-scale heterogeneous databases (e.g., Electronic Health Records) is critical for modern data management but is hindered by semantic heterogeneity.
Existing similarity-based methods only identify equivalent relationships, ignoring taxonomic ones, while LLMs suffer from hallucinations without external grounding.
Detecting duplicate attributes across disparate schemas requires domain expertise often missing from standard training data.

Concrete Example: In a healthcare scenario, the attribute 'ATTEND_DOCTOR' in one table is a duplicate of 'DOCTOR' in another, and both map to 'provider_id'. Standard methods fail to match 'ATTEND_DOCTOR' to 'provider_id' because they lack the specific domain knowledge connecting these terms.

Key Novelty

Knowledge Graph-based Retrieval-Augmented Generation for Schema Matching (KG-RAG4SM)

Augments LLM prompts with subgraphs retrieved from external large-scale Knowledge Graphs (KGs) like Wikidata to provide missing semantic context.
Introduces a hybrid retrieval strategy combining vector similarity search for entities/relations with BFS (Breadth-First Search) traversal to find relevant connections.
Employs a ranking scheme to prune irrelevant graph components, preventing context poisoning and keeping the prompt concise.

Architecture

The overall architecture of KG-RAG4SM, illustrating the flow from the input question to final answer generation via KG retrieval.

Evaluation Highlights

Outperforms LLM-based SOTA (Jellyfish-8B) by +35.89% precision and +30.50% F1 score on the MIMIC dataset.
With GPT-4o-mini, outperforms PLM-based SOTA (SMAT) by +69.20% precision and +21.97% F1 score on the Synthea dataset.
Demonstrates scalability and efficiency in end-to-end matching tasks without requiring LLM re-training.

Breakthrough Assessment

7/10

Significant performance gains in domain-specific schema matching by effectively applying Graph RAG. While the RAG concept is known, applying it to schema matching with specific traversal/pruning strategies is a strong contribution.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of attribute pairs from source and target schemas to determine semantic correspondence.

Inputs: Source attribute c_s, target attribute c_t, textual descriptions, and external Knowledge Graph KG.

Outputs: Binary decision (match/no-match) for the attribute pair (c_s, c_t).

Pipeline Flow

Knowledge Retrieval: Entity/Relation Retrieval → Subgraph Construction → Ranking/Pruning
Generation: Prompt Construction → LLM Inference

System Modules

Entity/Relation Retriever (Knowledge Retrieval)

Identify relevant entities and relations in the KG based on the input question

Model or implementation: RoBERTa (for embeddings) + ChromaDB (Vector Store)

Subgraph Constructor (Knowledge Retrieval)

Traverse the KG starting from retrieved entities to find connecting paths

Model or implementation: BFS (Breadth-First Search) algorithm

Subgraph Pruner/Ranker (Knowledge Retrieval)

Filter and rank paths to select the most semantically relevant context

Model or implementation: Frequency-based ranking + Path length normalization

Schema Matcher

Generate the final matching decision using the retrieved context

Model or implementation: GPT-4o-mini (or GPT-3.5)

Novel Architectural Elements

Hybrid retrieval pipeline integrating vector-based entity search with BFS traversal specifically for schema matching
Specific ranking scheme normalizing frequency-based scoring with path length to prune KG subgraphs

Modeling

Base Model: GPT-4o-mini (primary), GPT-3.5-turbo (comparison)

Training Method: Inference-only RAG (no training of the LLM)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Jellyfish-8B: KG-RAG4SM uses external knowledge retrieval without fine-tuning, whereas Jellyfish relies on internalized knowledge from fine-tuning.
vs. SMAT: KG-RAG4SM incorporates explicit semantic paths from KGs, whereas SMAT relies solely on textual features of schema elements.
vs. GraphRAG [not cited in paper]: Standard GraphRAG builds graphs from local text chunks; KG-RAG4SM retrieves from pre-existing massive external KGs (Wikidata).

Limitations

Dependency on the quality and coverage of the external Knowledge Graph (e.g., domain-specific terms might be missing in Wikidata).
Retrieval latency can be high when traversing large-scale graphs like Wikidata.
Performance depends on the embedding model's ability to map schema terms to KG entities correctly.

📊 Experiments & Results

Evaluation Setup

Binary classification of schema attribute pairs (match vs. no-match).

Benchmarks:

MIMIC-III (Healthcare Schema Matching)
Synthea (Synthetic Healthcare Data Schema Matching)
EMED (Real-world Healthcare Schema Matching (e-MedSolution to OMOP CDM)) [New]

Metrics:

Precision
Recall
F1 score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
KG-RAG4SM significantly outperforms baseline LLM and PLM methods on healthcare datasets.
MIMIC	Precision	0.45	0.6115	+0.1615
MIMIC	F1	0.462	0.6029	+0.1409
Synthea	Precision	0.50	0.846	+0.346
Synthea	F1	0.65	0.7928	+0.1428

Experiment Figures

A motivating example showing how KG context helps match 'ATTEND_DOCTOR' to 'provider_id'.

Main Takeaways

Retrieving subgraphs from external KGs effectively mitigates LLM hallucinations in schema matching by providing grounded semantic context.
Vector-based retrieval combined with BFS traversal is more effective than direct LLM-based query generation (e.g., generating SPARQL/Cypher) for large KGs.
The approach scales well to large KGs (like Wikidata) and does not require expensive re-training or fine-tuning of the base LLM.

📚 Prerequisite Knowledge

Prerequisites

Schema Matching basics (aligning database schemas)
Knowledge Graphs (entities, relations, triples)
Retrieval-Augmented Generation (RAG) concepts
Vector embeddings and similarity search

Key Terms

Schema Matching: The process of identifying semantic correspondences between elements of two different database schemas.

Graph RAG: Graph Retrieval-Augmented Generation—using structured data from knowledge graphs to enhance LLM prompts.

BFS: Breadth-First Search—a graph traversal algorithm that explores neighbor nodes layer by layer.

HNSW: Hierarchical Navigable Small World—an algorithm for approximate nearest neighbor search in high-dimensional spaces.

Triple: The fundamental unit of a knowledge graph, consisting of (Subject, Predicate, Object).

Context Poisoning: Performance degradation in LLMs caused by including too much irrelevant information in the prompt.

PLM: Pre-trained Language Model (e.g., BERT, RoBERTa).

SOTA: State-of-the-Art—the current best performing methods.