KG-Rank: Enhancing Large Language Models for Medical QA with Knowledge Graphs and Ranking Techniques

📝 Paper Summary

Graph-based RAG pipeline

KG-Rank augments LLMs for medical QA by retrieving triples from a knowledge graph and refining them via multi-stage ranking to provide factual, non-redundant context for answer generation.

Core Problem

General LLMs lack medical training data and factual consistency, while standard retrieval methods often introduce irrelevant or redundant noise that compromises credibility.

Why it matters:

Medical advice requires high precision; inaccurate LLM outputs can lead to critical health risks
Merely retrieving external knowledge risks introducing irrelevant information that distracts the model
Prior works utilized external knowledge but overlooked how to optimally order and filter that knowledge to reduce noise and redundancy

Concrete Example: When asking about diet for a patient with acute renal and hepatic failure, a standard LLM suggests 1.6-2.2g/kg protein (dangerous), whereas KG-Rank correctly identifies the need for restricted intake (0.8-1g/kg) by retrieving and prioritizing specific medical contraindications.

Key Novelty

Knowledge Graph Retrieval with Multi-Stage Ranking

Combines structured knowledge graph retrieval (triplets) with three distinct ranking strategies (Similarity, Answer Expansion, MMR) to filter noise
Applies a specialized re-ranking step using a medical cross-encoder to rigorously select only the most factually relevant triples before generation
Integrates Maximal Marginal Relevance (MMR) to specifically reduce redundancy in retrieved medical facts, ensuring diverse yet concise context

Architecture

The complete workflow of the KG-Rank framework, from entity extraction to answer generation.

Evaluation Highlights

+18% improvement in ROUGE-L score on the ExpertQA-Bio dataset compared to zero-shot baselines
Outperforms standard RAG baselines on 4 medical QA datasets (LiveQA, ExpertQA-Med, ExpertQA-Bio, MedicationQA) in automated metrics
+14% improvement in ROUGE-L score when extended to open domains (Law, Business, Music, History) using DBpedia

Breakthrough Assessment

7/10

Strong empirical gains in the high-stakes medical domain using a logical pipeline of KG retrieval and ranking. While the components (MMR, cross-encoders) are known, their specific application to graph-based medical RAG is effective.

⚙️ Technical Details

Problem Definition

Setting: Long-form Question Answering in the medical domain using external knowledge

Inputs: Medical question Q

Outputs: Free-text long-form answer A

Pipeline Flow

Entity Extraction (MedNER)
Relation Retrieval (One-hop)
Ranking (Sim/AE/MMR)
Re-ranking (MedCPT)
Generation (LLM)

System Modules

Entity Extractor

Identify medical entities in the question and map them to KG concepts

Model or implementation: LLM-based Prompt (MedNER)

Relation Retriever (Retrieval & Selection)

Fetch one-hop relations (triples) connected to the identified entities

Model or implementation: UMLS Database Query

Ranker (Retrieval & Selection)

Initial ordering of triples to filter noise

Model or implementation: UmlsBERT (for embedding) + Algorithmic Ranking

Re-ranker (Retrieval & Selection)

Refine the top candidates using a more expensive, accurate model

Model or implementation: MedCPT (Cross-Encoder)

Generator

Generate the final long-form answer using the question and selected triples

Model or implementation: GPT-4 or LLaMA2-13b

Novel Architectural Elements

Integration of Maximal Marginal Relevance (MMR) specifically for selecting diverse Knowledge Graph triplets in medical RAG
Pipeline combination of Answer Expansion (hallucinated answer) embedding for initial triple ranking followed by Cross-Encoder re-ranking

Modeling

Base Model: GPT-4 (primary), LLaMA2-13b, LLaMA2-7b, Baize-healthcare

Compute: Inference takes a few seconds per sample; experiments run on 4 NVIDIA A100 GPUs.

Comparison to Prior Work

vs. Almanac: KG-Rank specifically targets structured KG triples and uses MMR/Re-ranking to handle high-volume relation retrieval
vs. Standard RAG: Incorporates Answer Expansion (generating a draft to guide retrieval) specifically for triple ranking
vs. Graph-RAG [not cited in paper]: KG-Rank focuses on one-hop retrieval + heavy ranking rather than multi-hop graph traversal or reasoning paths

Limitations

Physician evaluations were limited (two residents) and planned for future expansion
Ranking methods increase computational latency compared to simple retrieval
Evaluation relies heavily on automated metrics (ROUGE, BERTScore) which may not fully capture medical factuality
Dependency on the coverage and quality of the underlying Knowledge Graph (UMLS/DBpedia)

Reproducibility

Code: https://github.com/YangRui525/KG-Rank

publicly available (https://github.com/YangRui525/KG-Rank). Code and data are provided. Prompts are included in Appendix A. Specific hyperparameters (w=0.1, delta=0.01 for MMR) are reported.

📊 Experiments & Results

Evaluation Setup

Zero-shot Medical Long-form QA

Benchmarks:

LiveQA (Consumer health QA)
ExpertQA (Med & Bio) (High-quality long-form QA verified by experts)
MedicationQA (Drug-related consumer questions)

Metrics:

ROUGE-L
BERTScore
MoverScore
BLEURT
GPT-4 Score (Factuality)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
KG-Rank with Re-ranking (RR) consistently improves performance over Zero-Shot (ZS) baselines across multiple datasets.
ExpertQA-Bio	ROUGE-L	0.2227	0.3015	+0.0788
ExpertQA-Med	ROUGE-L	0.2185	0.2393	+0.0208
MedicationQA	ROUGE-L	0.2312	0.2520	+0.0208
LiveQA	ROUGE-L	0.2547	0.2921	+0.0374
ExpertQA-Bio	ROUGE-L	0.2745	0.3015	+0.0270
Mintaka	Accuracy	60.40	61.90	+1.50

Experiment Figures

A case study comparing a standard GPT-4 answer vs. KG-Rank answer regarding diet for renal/hepatic failure.

Main Takeaways

Integrating Knowledge Graphs with Ranking/Re-ranking significantly boosts factual consistency in long-form medical QA.
Re-ranking (RR) is generally the most effective strategy, but Answer Expansion (AE) can be superior in datasets with noisier or less structured answers (like LiveQA).
Specialized medical re-rankers (MedCPT) outperform general commercial re-rankers (Cohere) for medical retrieval tasks.
The framework generalizes effectively to open domains (Law, History, etc.) by swapping the underlying KG (e.g., UMLS to DBpedia/Wikipedia).

📚 Prerequisite Knowledge

Prerequisites

Knowledge Graphs (structure of entities and relations)
Retrieval-Augmented Generation (RAG) concepts
Ranking metrics (similarity, MMR)
Basic understanding of Transformer-based LLMs

Key Terms

UMLS: Unified Medical Language System—a comprehensive repository of health and biomedical vocabularies and standards

Triplets: The fundamental unit of data in a knowledge graph, consisting of (Subject, Relation, Object)

MMR: Maximal Marginal Relevance—a ranking method that balances relevance to the query with diversity (novelty) relative to already selected items

Cross-encoder: A model architecture that processes two inputs (query and document) simultaneously to output a relevance score, typically more accurate but slower than bi-encoders

ROUGE-L: A metric measuring the overlap of the longest common subsequence between a generated summary and a reference, assessing structural similarity

UmlsBERT: A BERT model pre-trained on the UMLS metathesaurus to understand medical concepts and relations

MedCPT: A medical cross-encoder model trained on PubMed articles, used here for re-ranking retrieved triples