KGARevion: An AI Agent for Knowledge-Intensive Biomedical QA

📝 Paper Summary

Agentic RAG pipeline Graph-based RAG pipeline

KGARevion is a biomedical agent that prompts an LLM to generate knowledge triplets, then verifies and corrects them using a fine-tuned model grounded in structural knowledge graph embeddings before answering.

Core Problem

General-purpose LLMs lack specialized biomedical knowledge and often hallucinate, while standard RAG methods relying on direct KG retrieval miss implicit relationships and lack mechanisms to verify retrieved information.

Why it matters:

Incorrect medical advice from AI can have serious safety implications in clinical settings
Standard retrieval fails to capture biological similarities between proteins that lack direct edges in a Knowledge Graph
Current systems struggle to integrate codified scientific knowledge (databases) with tacit clinical intuition (LLM reasoning)

Concrete Example: When asking about gene interactions with the Heat Shock Protein 70 family involved in Retinitis Pigmentosa 59, an LLM might hallucinate a connection or fail to distinguish between HSPA4 and HSPA8 due to subtle semantic overlap, whereas KGARevion explicitly verifies these links against the KG structure.

Key Novelty

Agentic Generate-Verify-Revise Loop with Structural KG Embeddings

Instead of just retrieving from a graph, the agent first hallucinates (generates) potential triplets using the LLM's intuition
It then strictly verifies these generated triplets using a separate LLM fine-tuned with TransE structural embeddings from the Knowledge Graph to catch errors
If errors are found, a 'Revise' action corrects the triplets before the final answer is generated, ensuring grounding

Architecture

The KGARevion framework illustrating the four-step process: Generate, Review, Revise, and Answer.

Evaluation Highlights

Achieves a 6.75% accuracy improvement over 15 baseline models across seven medical QA datasets
Improves accuracy by 10.4% on three newly curated medical QA datasets designed with varying levels of semantic complexity
Demonstrates strong zero-shot generalization on AfriMed-QA (African healthcare dataset), effectively handling underrepresented medical contexts

Breakthrough Assessment

8/10

Strong methodological contribution by combining generative flexibility with strict structural verification. Significant empirical gains on specialized benchmarks and good robustness analysis.

⚙️ Technical Details

Problem Definition

Setting: Biomedical Question Answering (Multiple-choice and Open-ended)

Inputs: A biomedical question q and a set of candidate answers C

Outputs: The correct answer identified from C

Pipeline Flow

Generate Action: LLM generates candidate knowledge triplets from the question
Review Action: Fine-tuned LLM verifies triplets using structural KG embeddings
Revise Action: Agent corrects rejected triplets if necessary
Answer Action: LLM generates final response using verified triplets

System Modules

Generate Action

Prompt LLM to extract medical concepts and hypothesize relevant triplets

Model or implementation: LLM (e.g., GPT-4 or similar, depending on experiment)

Review Action

Verify correctness of generated triplets using KG structural knowledge

Model or implementation: LLM fine-tuned with LoRA on KG completion task

Revise Action

Modify false triplets to find valid connections in the KG

Model or implementation: Agent Logic / LLM Prompting

Answer Action

Synthesize final answer based on verified 'True' triplets

Model or implementation: LLM (General Purpose)

Novel Architectural Elements

Integration of pre-trained structural graph embeddings (TransE) directly into the LLM's prompt via an alignment projector (Attention + FFN) for verification
Generate-then-Verify-then-Revise workflow that specifically treats LLM-generated triplets as hypotheses to be structurally validated

Modeling

Base Model: Evaluated with various LLMs (e.g., GPT-4, Llama-3-70B, Meditron-70B)

Training Method: LoRA Fine-tuning on KG Completion Task

Objective Functions:

Purpose: Train the adapter and LoRA weights to predict triplet correctness.

Formally: Next-token prediction loss on the output 'True' or 'False'

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: Alignment projector weights (W1, W2) and LoRA adapter weights

Training Data:

Triplets extracted from UMLS-based Knowledge Graphs
Fine-tuning task: Given (h, r, t) + embeddings, predict True/False

Key Hyperparameters:

embedding_method: TransE
max_revision_rounds: Not explicitly reported in the paper (implied k >= 1)

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. GenGround: KGARevion uses structural embeddings (TransE) for verification, whereas GenGround relies on semantic dependencies
vs. GraphRAG: KGARevion uses a generative approach to hypothesize triplets first, rather than just retrieving existing paths
vs. Self-RAG: Focuses specifically on structured biomedical triplets and KG alignment rather than general text retrieval
+ 1 more
vs. MedGraphRAG: Adds a specific 'Revise' step to correct hallucinations dynamically

Limitations

Relies on the completeness of the underlying Knowledge Graph; 'Incomplete Knowledge' assumption may let some hallucinations pass if entities aren't in KG
Performance depends on the quality of the entity mapping to UMLS codes
The 'Revise' loop adds computational overhead compared to single-pass retrieval methods
Requires training specific structural embeddings (TransE) for the KG before inference can begin

Reproducibility

Code: https://github.com/mims-harvard/KGARevion

Code is publicly available at https://github.com/mims-harvard/KGARevion. The paper uses standard benchmarks and introduces new ones (AfriMed-QA). Specific GPU hours or training costs are not detailed.

📊 Experiments & Results

Evaluation Setup

Medical Question Answering (Multiple Choice and Open Ended)

Benchmarks:

MedQA (Medical QA)
MedMCQA (Medical QA)
PubMedQA (Medical QA)
AfriMed-QA (African Healthcare QA) [New]
3 New Semantic Complexity Datasets (Medical QA with varying complexity) [New]

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
KGARevion demonstrates superior performance across standard medical QA benchmarks compared to baseline models.
MedQA	Accuracy	81.6	85.8	+4.2
MedMCQA	Accuracy	72.4	78.2	+5.8
Generalization to underrepresented domains is tested using the newly introduced AfriMed-QA dataset.
AfriMed-QA	Accuracy	78.5	84.3	+5.8
Aggregate performance analysis across multiple datasets and baselines.
7 datasets average	Accuracy	Not reported in the paper	Not reported in the paper	+6.75%

Experiment Figures

Comparison of reasoning approaches: Direct LLM answering vs. RAG vs. KGARevion.

Main Takeaways

KGARevion consistently outperforms both standard LLMs (like GPT-4) and specialized medical models (Med-PaLM 2) across diverse benchmarks.
The model shows robust zero-shot capabilities on the AfriMed-QA dataset, suggesting the KG grounding helps in underrepresented domains.
The approach is robust to input variations such as answer reordering, unlike standard LLMs which show high variance.
Grounding through generated triplets and subsequent structural verification is more effective than direct retrieval (RAG) which lacks verification.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Graph (KG) fundamentals (entities, relations, triplets)
Retrieval-Augmented Generation (RAG)
LoRA fine-tuning

Key Terms

KGARevion: Knowledge Graph-Based Agent for Revision—the proposed system that generates, reviews, and revises knowledge triplets

TransE: A method for learning low-dimensional embeddings of entities and relationships in a Knowledge Graph by modeling relationships as translations

Triplets: The fundamental unit of data in a Knowledge Graph, consisting of (Head Entity, Relation, Tail Entity)

Structural Embeddings: Vector representations of KG nodes that capture the graph topology, distinct from the semantic (text) embeddings used by standard LLMs

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights

UMLS: Unified Medical Language System—a compendium of many controlled vocabularies in the biomedical sciences

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer

Zero-shot generalization: The ability of a model to perform a task or handle data (like African healthcare contexts) it was not explicitly trained on