Faithful Path Language Modelling for Explainable Recommendation over Knowledge Graph

📝 Paper Summary

Explainable Recommendation Knowledge Graph Reasoning Path Language Modeling

PEARLM eliminates hallucinated explanations in recommender systems by enforcing knowledge graph constraints during decoding and learning token embeddings directly from paths rather than pre-trained embeddings.

Core Problem

Existing path-based language models for recommendation often generate 'hallucinated' paths—sequences of entities and relations that do not actually exist in the Knowledge Graph (KG)—leading to unfaithful explanations.

Why it matters:

Hallucinated paths (e.g., inventing a 'starred in' relation between a user and a movie) erode user trust and fail GDPR 'right to explanation' requirements.
Current models rely on pre-trained KG embeddings that are optimized for link prediction, not path generation, limiting recommendation accuracy.
Users struggle to detect subtle inaccuracies in explanations, meaning systems must guarantee structural faithfulness by design.

Concrete Example: A model might explain a recommendation for 'Interstellar' by claiming the user watched 'Movie A' which starred 'Johnny Depp' who starred in 'Interstellar', even if Johnny Depp is not in Interstellar within the KG. Existing models like PLM generate such non-existent triplets 94% of the time at hop 3.

Key Novelty

Path-based Explainable-Accurate Recommender based on Language Modelling (PEARLM)

Treats KG paths as sentences and trains a causal language model (Transformer) to generate them, predicting the next entity/relation token.
Introduces Knowledge Graph Constraint Decoding (KGCD) to force the model to only select tokens that are valid neighbors in the KG, guaranteeing zero hallucinations.
Learns embeddings directly from path tokens from scratch (Direct Embedding Learning) rather than initializing with pre-trained KG embeddings (like TransE), capturing richer path-centric semantics.

Architecture

The PEARLM framework pipeline: (a) Path Extraction from KG, (b) Pretraining Causal Language Model, (c) KG-Constrained Decoding.

Evaluation Highlights

+42% to +78% improvement in NDCG over best baselines (KGAT, CKE) on MovieLens1M and LastFM datasets.
Achieves 100% Path Faithfulness Rate (PFR), completely eliminating corrupted paths, whereas the PLM baseline drops to ~6-10% faithfulness at hop 3.
Outperforms state-of-the-art in Coverage by up to 73% on LastFM while maintaining high Serendipity and Novelty.

Breakthrough Assessment

8/10

Significantly advances explainable recommendation by solving the hallucination problem (structural unfaithfulness) while simultaneously delivering massive gains in recommendation utility.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive path generation over a Knowledge Graph G = {(eh, r, et)}.

Inputs: A user entity u and a starting relation rf (interaction).

Outputs: A sequence of tokens (entities and relations) forming a valid path ending in a recommended item p, serving as both recommendation and explanation.

Pipeline Flow

Path Sampling (Random walks to create training sequences)
Tokenizer & Embedding (Entities/Relations → Tokens)
Transformer Decoder (Causal Language Model)
KGCD (Constrained Beam Search for inference)

System Modules

Path Sampler

Extracts user-centric paths from the KG to serve as training data (sentences).

Model or implementation: Random Walk

PEARLM Decoder

Learns to predict the next token (entity or relation) in a path sequence.

Model or implementation: GPT-2 architecture (Distil, Medium, or Large variants)

KGCD Controller

Modifies logits during beam search to enforce KG validity.

Model or implementation: Rule-based Logit Masking

Novel Architectural Elements

Unified modeling of entities and relations as a single vocabulary stream within a standard Transformer Decoder (unlike PLM which used separate heads/parameters for entities and relations)
Integration of hard KG constraints directly into the decoding loop (KGCD) to structurally guarantee path validity

Modeling

Base Model: GPT-2 (DistilGPT2, GPT2-Medium, GPT2-Large)

Training Method: Causal Language Modeling (Next Token Prediction)

Objective Functions:

Purpose: Maximize probability of the correct next token in the path.

Formally: Standard Cross-Entropy Loss over the vocabulary V.

Adaptation: Full training of embeddings and transformer weights from scratch

Trainable Parameters: All weights (Embeddings + Transformer Layers)

Training Data:

User-centric paths sampled via random walks from MovieLens1M and LastFM KGs
Max hops: 3 to 5
Context length L = 2*N + 1

Key Hyperparameters:

beam_size: 30
beam_groups: 5
diversity_penalty: 0.3
+ 2 more
sequences_generated: 100
training_iterations: Fixed budget (emulating realistic constraints)

Compute: Not reported in the paper

Comparison to Prior Work

vs. PLM: PEARLM uses direct embedding learning (vs. fixed KGE), unified architecture (vs. separate heads), and KGCD (vs. unconstrained decoding).
vs. KGAT/CKE: PEARLM provides explanations via paths (KGAT is accurate but opaque/embedding-based).
vs. PGPR/CAFE: PEARLM uses a generative language modeling approach rather than reinforcement learning agents.

Limitations

KGCD effectiveness depends on the connectivity of the underlying KG (requires users to be connected to enough items within k hops).
Performance drops with longer path lengths (5 hops vs 3 hops) potentially due to noise/complexity.
Requires explicit path sampling preprocessing step.

Reproducibility

Code: https://tinyurl.com/pearlmrecsys

Source code and data are publicly available at https://tinyurl.com/pearlmrecsys. The paper specifies datasets (ML1M, LFM1M) and preprocessing steps (removing items not in KG, k-core filtering). Hyperparameters for beam search are explicitly listed.

📊 Experiments & Results

Evaluation Setup

Top-N Recommendation (N=10) with explanation path generation.

Benchmarks:

MovieLens1M (ML1M) (Movie Recommendation)
LastFM (LFM1M) (Music Recommendation)

Metrics:

NDCG@10
MRR@10
Precision@10
Recall@10
Path Faithfulness Rate (PFR@k)
Serendipity
Diversity
Novelty
Coverage
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MovieLens1M	NDCG@10	0.31	0.44	+0.13
LastFM	NDCG@10	0.33	0.58	+0.25
MovieLens1M	PFR@3 (Path Faithfulness Rate)	0.06	1.0	+0.94
LastFM	PFR@3	0.10	1.0	+0.90
MovieLens1M	Coverage	0.64	0.80	+0.16
MovieLens1M	NDCG@10	0.29	0.44	+0.15

Experiment Figures

Conceptual illustration of path reasoning vs. template-based explanations, showing how hallucinations occur.

Main Takeaways

PEARLM achieves 100% path faithfulness, completely eliminating the 'hallucination' problem prevalent in prior path language models (PLM ~6-10% faithfulness).
Directly learning embeddings for the path generation task yields massive utility gains (+40-70% NDCG) compared to initializing with pre-trained KGEs (TransE).
KG Constraint Decoding (KGCD) ensures structural validity without sacrificing recommendation accuracy; in fact, valid paths often correlate with better recommendations.
PEARLM successfully balances accuracy with beyond-accuracy goals (Serendipity, Diversity, Coverage), often outperforming both RL-based and GNN-based baselines.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Graphs (KG) structure (entities, relations, triplets)
Causal Language Modeling (CLM) / Autoregressive generation
Transformer architecture (specifically Decoders)
Beam Search decoding

Key Terms

PFR@k: Path Faithfulness Rate at hop k—the percentage of generated paths that contain NO corrupted/non-existent triplets up to the k-th hop.

KGCD: Knowledge Graph Constraint Decoding—a decoding strategy that masks out tokens with -inf probability if they are not connected to the current entity in the KG.

Hallucination: Generation of paths or relations that do not exist in the underlying Knowledge Graph (e.g., linking a user to a movie via 'acted_in').

Direct Embedding Learning: Learning token embeddings from scratch during the language modeling task, rather than using fixed embeddings pre-trained via KGE methods like TransE.

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items in the recommendation list.

Causal Language Modeling: Training a model to predict the next token in a sequence based only on previous tokens.