Graph Neural Network Enhanced Retrieval for Question Answering of Large Language Models

📝 Paper Summary

Graph-based RAG pipeline

GNN-Ret constructs a graph of passages based on structural and keyword connections and uses a Graph Neural Network to propagate relevance scores, improving retrieval for complex questions.

Core Problem

Existing retrieval methods treat passages in isolation based on semantic distance, failing to retrieve supporting passages that share context (like keywords or document structure) but lack direct semantic similarity to the query.

Why it matters:

Complex questions often have information asymmetry: a short inquiry focuses on one aspect while background details in the question connect to other passages
LLMs struggle to answer multi-hop questions when retrieval misses intermediate reasoning steps due to poor semantic overlap with the initial question

Concrete Example: For the question 'Where was the performer of song Left & Right born?', standard retrieval finds the performer (D'Angelo) but misses his birthplace passage because the birthplace passage doesn't mention 'Left & Right'. GNN-Ret connects them via the shared entity 'D'Angelo'.

Key Novelty

Graph Neural Network Enhanced Retrieval (GNN-Ret) and Recurrent GNN (RGNN-Ret)

Constructs a 'Graph of Passages' (GoPs) where nodes are text chunks and edges represent structural adjacency or shared keywords extracted by an LLM
Uses a GNN to update the semantic distance of a passage by aggregating minimum distances from its neighbors, allowing relevant but semantically distant passages to be retrieved via their connections
For multi-hop questions, RGNN-Ret uses a Recurrent GNN to integrate retrieval states across reasoning steps, helping subsequent steps find passages related to previous retrievals

Architecture

Overview of GNN-Ret vs Dense Retrieval (Fig 1) and the workflow of RGNN-Ret (Fig 3).

Evaluation Highlights

RGNN-Ret achieves state-of-the-art accuracy on 2WikiMQA (55.8%), outperforming the strong baseline KGP by 10.6%
GNN-Ret improves accuracy by 4.0% over SBERT on the IIRC dataset using a single retrieval step
RGNN-Ret outperforms SelfAsk (a multi-step prompting baseline) by 11.0% accuracy on 2WikiMQA

Breakthrough Assessment

8/10

Significant improvement on multi-hop QA by explicitly modeling passage relationships. effectively addresses the 'isolated passage' assumption of standard dense retrieval.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering where a system must retrieve supporting passages from a corpus to answer a query q

Inputs: Natural language question q (and potentially sub-questions q_t for multi-hop)

Outputs: Final answer a and a set of retrieved supporting passages

Pipeline Flow

Graph Construction: Build GoPs from corpus (offline)
Initial Retrieval: Compute semantic distances for all nodes using SBERT
GNN Propagation: Update distances using GNN (GNN-Ret) or RGNN (RGNN-Ret)
Ranking & Selection: Select top-k passages with smallest integrated distances
Generation: LLM generates answer using retrieved passages

System Modules

Graph Constructor

Builds the Graph of Passages (GoPs) by connecting adjacent passages (structural) and passages sharing keywords (keyword-based)

Model or implementation: ChatGPT (for keyword extraction)

Dense Retriever

Calculates initial semantic distances between the query and all passages

Model or implementation: SBERT (multi-qa-mpnet-base-cos-v1)

GNN / RGNN

Updates semantic distances by aggregating information from related passages in the graph

Model or implementation: 1-layer GNN (GNN-Ret) or RGNN (RGNN-Ret)

LLM Reader

Generates the final answer (and sub-questions for RGNN-Ret)

Model or implementation: ChatGPT (gpt-3.5-turbo-2023-06-01-preview)

Novel Architectural Elements

GNN-based re-ranking where node features are semantic distances to the query rather than static embeddings
Recurrent integration of retrieval states: RGNN updates passage scores by combining current step's semantic distance with the previous step's integrated distance

Modeling

Base Model: SBERT (multi-qa-mpnet-base-cos-v1) for embeddings; ChatGPT for generation

Training Method: Gradient descent on GNN parameters (alpha, beta)

Objective Functions:

Purpose: Ensure supporting passages have lower integrated semantic distances than non-supporting passages.

Formally: max(0, r + d_target - d_non_target), where d represents average integrated semantic distance.

Training Data:

500 questions sampled from development sets of MuSiQue, IIRC, 2WikiMQA
20 questions used for training GNN parameters, rest for testing

Key Hyperparameters:

learning_rate: 1.0
GNN_layers: 1
alpha_1: 0.5 (initial)
+ 4 more
beta: 0.9 (initial)
margin_r: 0.01
K_neighbors: 5
O_competitive_set: 10 or 25

Compute: Not reported in the paper

Comparison to Prior Work

vs. KGP: GNN-Ret propagates dense semantic scores through the graph rather than just retrieving neighbors of seed matches
vs. SelfAsk/IRCoT: GNN-Ret explicitly uses passage structure to find relevant documents that lack semantic overlap with the query, rather than relying solely on generated sub-questions
vs. Graph-CoT [not cited in paper]: GNN-Ret focuses on retrieval enhancement via GNNs rather than traversing the graph for reasoning paths

Limitations

Relies on multiple LLM queries for keyword extraction during offline graph construction, which can be costly
Assumes an undirected graph without edge types, potentially losing nuance in relationships
Evaluation uses a closed-source LLM (ChatGPT), which may change over time
Training and validation done on a relatively small subset of questions (500 sampled)

Reproducibility

Code: https://github.com/zli999/GNN_Ret

Code is publicly available at https://github.com/zli999/GNN_Ret. The paper uses closed-source ChatGPT for generation and keyword extraction. Training uses a very small subset (20 samples), aiding reproducibility of the training process itself.

📊 Experiments & Results

Evaluation Setup

Open-domain QA on multi-hop and long-context datasets

Benchmarks:

MuSiQue (Multi-hop reasoning QA)
IIRC (Incomplete Information Reading Comprehension)
2WikiMQA (Multi-hop reasoning QA)
Quality (Long-context QA)

Metrics:

F1 score
Exact Match (EM)
Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of single-hop retrieval performance (GNN-Ret) against dense retrieval baselines.
2WikiMQA	Accuracy	42.3	46.9	+4.6
IIRC	Accuracy	40.4	44.4	+4.0
Comparison of multi-hop retrieval performance (RGNN-Ret) against multi-step baselines.
2WikiMQA	Accuracy	45.2	55.8	+10.6
MuSiQue	Accuracy	27.5	31.3	+3.8
Ablation study showing the impact of graph components (Structural Information vs Shared Keywords).
2WikiMQA	Accuracy	42.3	46.9	+4.6

Experiment Figures

Accuracy vs Average Number of LLM Queries on 2WikiMQA.

Main Takeaways

GNN-Ret significantly improves retrieval accuracy over dense baselines (SBERT) by leveraging passage relatedness, particularly for questions with information asymmetry.
RGNN-Ret achieves state-of-the-art results on 2WikiMQA, demonstrating the value of recurrently updating retrieval states across reasoning steps.
Both structural adjacency and shared keywords contribute to performance, with their combination yielding the best results.
The method is robust across different LLM backbones (ChatGPT, Qwen, Gemma) and maintains advantages even with long-context models.

📚 Prerequisite Knowledge

Prerequisites

Graph Neural Networks (GNNs)
Dense Passage Retrieval (DPR) / Semantic Search
Multi-hop Question Answering

Key Terms

GoPs: Graph of Passages—a graph where nodes are passages and edges represent structural or keyword relationships

GNN: Graph Neural Network—a neural network that processes data represented as graphs

RGNN: Recurrent Graph Neural Network—a GNN variant that maintains state across time steps, used here for multi-hop reasoning steps

SBERT: Sentence-BERT—a modification of the BERT network to derive semantically meaningful sentence embeddings

Self-critique: A mechanism where the LLM evaluates its own intermediate answers to decide whether to continue reasoning or output a final answer

Hinge objective: A loss function used to train the GNN that encourages the score of positive examples (supporting passages) to be lower (better) than negative examples by a margin

KGP: Knowledge Graph Prompting—a baseline method that retrieves neighbors of seed nodes in a Knowledge Graph

semantic distance: A metric (typically 1 - cosine similarity) measuring how unrelated two text embeddings are