G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation

📝 Paper Summary

Explainable Recommendation Graph Retrieval-Augmented Generation (GraphRAG)

G-Refer combines path-level and node-level graph retrieval with a knowledge pruning mechanism to help Large Language Models generate explicit, accurate, and stable explanations for recommendations.

Core Problem

Existing methods struggle to extract collaborative filtering (CF) information from complex graphs and effectively integrate these implicit, structured signals into LLMs for textual explanations.

Why it matters:

LLMs alone lack the specific collaborative context to explain *why* a user likes an item
Implicit GNN embeddings are opaque and hard to interpret, making it difficult to verify the logic behind a recommendation
There is a modality gap between structured graph data (nodes/edges) and the natural language generation required for user-facing explanations

Concrete Example: A GNN might predict a user likes 'Doctor Strange' because of an embedding dot product. However, it fails to explicitly tell the LLM that the user previously watched 'Iron Man' (path connection) or likes the actor Benedict Cumberbatch (semantic connection), leading the LLM to hallucinate generic reasons rather than citing specific evidence.

Key Novelty

Hybrid Graph Retrieval with Knowledge Pruning

Retrieves collaborative filtering signals from two perspectives: 'Path-level' (structural connections like User->Item->User->Item) and 'Node-level' (semantic similarities based on text profiles)
Translates these retrieved graph components into flattened natural language text to prompt the LLM
Filters out 'easy' training samples where the profile alone is sufficient (Knowledge Pruning), forcing the model to learn from cases where graph knowledge is actually necessary

Architecture

The overall G-Refer pipeline, illustrating the flow from user-item input to explanation generation via hybrid retrieval.

Evaluation Highlights

+8.67% improvement in BERT-Recall on the Yelp dataset compared to the strongest baseline (XRec)
+7.48% improvement in BERT-Recall on Google-reviews compared to XRec, indicating better coverage of key explanation information
Achieves higher stability (lower standard deviation) across GPT, BERT, and BART metrics compared to baselines like PEPLER and PETER

Breakthrough Assessment

7/10

Strong empirical results and a well-motivated architecture combining structural and semantic retrieval. The knowledge pruning idea for RAG training is a practical insight.

⚙️ Technical Details

Problem Definition

Setting: Explainable Recommendation on Bipartite Graphs

Inputs: User u, Recommended Item i, User-Item Graph G, User Profile b_u, Item Profile c_i

Outputs: Natural language explanation text explaining why u would like i

Pipeline Flow

Path-level Retriever (finds structural paths)
Node-level Retriever (finds semantically similar nodes)
Graph Translation (flattens structure to text)
LLM Generation (fine-tuned with LoRA)

System Modules

Path-level Retriever (Retrieval & Selection)

Extract structural CF information by finding paths connecting the user and item

Model or implementation: R-GCN (Encoder) + Mask Learning + Dijkstra

Node-level Retriever (Retrieval & Selection)

Extract semantic CF information by finding similar users/items based on text profiles

Model or implementation: SentenceBERT (Dual Encoder)

Graph Translation

Convert retrieved paths and nodes into a natural language prompt

Model or implementation: Rule-based Flattening

LLM Generator

Generate the final explanation

Model or implementation: Llama-2-7B or Llama-3-8B (with LoRA)

Novel Architectural Elements

Hybrid retrieval combining learned structural paths (via mask optimization) and dense semantic retrieval
Knowledge pruning pipeline component that filters training data based on profile-explanation similarity

Modeling

Base Model: Llama-2-7B and Llama-3-8B

Training Method: Retrieval-Augmented Fine-Tuning (RAFT) with LoRA

Objective Functions:

Purpose: Train the LLM to generate explanations given the profile and retrieved context.

Formally: Standard language modeling loss on the pruned dataset D_prune: -sum log P(Explain(u,i) | b_u, c_i, K(u,i), Q; theta)
Purpose: Learn edge masks for path retrieval (prior to LLM training).

Formally: L(M) = L_pred(M) + L_path(M), combining a prediction loss (importance for recommendation) and a path loss (conciseness)

Adaptation: LoRA (rank=8)

Trainable Parameters: LoRA adapters (approx. 19.9M parameters vs 4.2M for XRec)

Training Data:

Amazon-books: 95,841 train / 3,000 test
Yelp: 74,212 train / 3,000 test
Google-reviews: 94,663 train / 3,000 test
Training data is filtered via Knowledge Pruning (ratio t=70% typically)

Key Hyperparameters:

learning_rate: 2e-5
epochs: 2
batch_size: 32 (7B) / 16 (8B)
+ 4 more
lora_rank: 8
retrieved_paths_k: 2
retrieved_nodes_k: 2
pruning_ratio_t: 70%

Compute: 8 NVIDIA A100 GPUs

Comparison to Prior Work

vs. XRec: G-Refer uses explicit text-based graph retrieval (paths/nodes) instead of implicit embeddings, enabling better interpretability and control.
vs. PEPLER/PETER: G-Refer incorporates external graph structure knowledge, whereas PEPLER/PETER rely solely on internal model knowledge and user/item IDs.
vs. GraphRAG [cited]: G-Refer tailors the retrieval specifically for recommendation (user-item paths) and adds knowledge pruning.

Limitations

Performance gain on sparse graphs (like Amazon-books) is limited compared to denser graphs.
Requires fine-tuning an LLM adapter (LoRA), which is more computationally expensive than zero-shot prompting.
The retrieval process introduces a trade-off: higher recall/completeness but slightly lower precision in generated text.

Reproducibility

Code: https://anonymous.4open.science/r/G-Refer

Code and data available at https://anonymous.4open.science/r/G-Refer. Uses public datasets (Amazon, Yelp, Google). Specific prompt templates are provided in the paper. Pre-trained weights for SentenceBERT and Llama are open source.

📊 Experiments & Results

Evaluation Setup

Generate explanations for held-out user-item pairs in the test set.

Benchmarks:

Amazon-books (Explainable Recommendation)
Yelp (Explainable Recommendation)
Google-reviews (Explainable Recommendation)

Metrics:

BLEU
ROUGE
GPT Score
BERT Score (Precision, Recall, F1)
BART Score
BLEURT
USR
Statistical methodology: Standard deviation reported for stability analysis. No significance tests (e.g., t-test) explicitly reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis against baselines showing improvements in semantic coverage (Recall) and overall quality (F1/GPT Score).
Yelp	BERT-Recall	0.3506	0.4373	+0.0867
Yelp	BERT-F1	0.3730	0.4003	+0.0273
Google-reviews	BERT-Recall	0.4069	0.4935	+0.0866
Amazon-books	BERT-F1	0.4122	0.4289	+0.0167
Yelp	BERT-F1	0.4002	0.4003	+0.0001
Yelp	BERT-F1	0.3927	0.4003	+0.0076

Experiment Figures

Effect of the number of retrieved items (k) on BERT-Precision and BERT-Recall.

Efficiency analysis comparing Training Time vs F1-score for G-Refer, XRec, and Full-set training.

Main Takeaways

Explicit graph retrieval (G-Refer) significantly outperforms implicit embedding-based methods (XRec) in recall, meaning explanations cover more relevant ground truth details.
Knowledge pruning allows the model to train on much less data (e.g., retaining only 70%) while maintaining or even slightly improving performance by focusing on 'hard' examples requiring external knowledge.
Both path-level (structural) and node-level (semantic) retrieval are necessary; ablating either drops performance, with semantic retrieval being more critical on Yelp and structural on Google-reviews.
Human evaluation confirms a strong preference for G-Refer explanations (chosen >80% of the time on Yelp/Google) over XRec.

📚 Prerequisite Knowledge

Prerequisites

Graph Neural Networks (GNNs) for recommendation
Retrieval-Augmented Generation (RAG)
Collaborative Filtering (CF)
LoRA (Low-Rank Adaptation)

Key Terms

CF information: Collaborative Filtering information—patterns in user behavior (who bought what) used to predict preferences

R-GCN: Relational Graph Convolutional Network—a GNN variant used here to encode user-item interactions

m-core pruning: A graph preprocessing step that iteratively removes nodes with a degree less than m to reduce noise

RAFT: Retrieval-Augmented Fine-Tuning—fine-tuning the LLM specifically to utilize retrieved documents/context

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

Dijkstra's algorithm: A shortest-path algorithm used here to find the most relevant explanation paths in the graph

BERTScore: An evaluation metric that computes similarity between candidate and reference sentences using BERT embeddings (Precision, Recall, F1)

Knowledge Pruning: A proposed filtering strategy to remove training samples where the ground truth explanation is semantically similar to the input profiles, focusing training on harder cases