Integrating LLM-Derived Multi-Semantic Intent into Graph Model for Session-based Recommendation

📝 Paper Summary

Session-based Recommendation (SBR) Graph Neural Networks (GNNs) Large Language Models (LLMs) in Recommendation

LLM-DMsRec integrates LLM-inferred multi-faceted user intents (explicit and latent) with GNN-based structural patterns using a candidate item set to ground the LLM's reasoning.

Core Problem

Existing GNN-based methods for session recommendation rely heavily on ID sequences, ignoring rich semantic information, while direct LLM applications often suffer from hallucinations and fail to align semantic insights with structural graph data.

Why it matters:

Users in session-based systems often have dynamic, multi-faceted intentions (e.g., browsing both phones and headphones) that single-vector ID embeddings fail to capture
Pure LLM approaches lack the collaborative signal found in interaction graphs, leading to plausible but incorrect recommendations (hallucinations)
Bridging the gap between semantic understanding (text) and structural patterns (IDs) is critical for accurate next-item prediction

Concrete Example: A user clicks 'iPhone 16' -> 'Airpods Pro' -> 'Airpods Max' -> 'Watch SE'. A traditional model sees only IDs. An LLM can infer two distinct intents: 'purchase headphones' (first three items) and 'acquire phone-connected device' (last three). LLM-DMsRec explicitly models both to refine the next prediction.

Key Novelty

Integrating LLM-Derived Multi-Semantic Intent into Graph Model (LLM-DMsRec)

Uses a pre-trained GNN to generate a candidate item set, acting as a 'knowledge base' to constrain and ground the LLM's reasoning, reducing hallucinations
Prompts the LLM to infer multiple semantic intents (explicit vs. latent) from the session text and candidate items, rather than a single summary
Aligns these semantic intent representations with the GNN's structural representations using a KL divergence strategy during training

Architecture

The overall architecture of LLM-DMsRec, illustrating the three stages: Candidate Item Selection, Intent Inference, and Alignment/Training.

Evaluation Highlights

Outperforms state-of-the-art baselines on Beauty dataset: +7.24% improvement in MRR@20 compared to the best baseline (GCE-GNN)
Achieves superior performance on ML-1M dataset: +2.90% improvement in P@20 over the strongest baseline
Successfully integrates with multiple GNN backbones (SR-GNN, GCE-GNN, DHCN), consistently improving their performance

Breakthrough Assessment

7/10

Solid contribution in aligning LLM semantic reasoning with GNN structural signals. The method of using GNN candidates to ground LLM reasoning is a practical solution to hallucinations in recommendation.

⚙️ Technical Details

Problem Definition

Setting: Session-based recommendation to predict the next item in a sequence

Inputs: User interaction session sequence s_t = (v_1, v_2, ..., v_l) containing item IDs and associated textual metadata

Outputs: Ranked list of candidate items likely to be the next interaction v_{l+1}

Pipeline Flow

Candidate Item Selection (Pre-trained GNN)
Intent Inference & Classification (LLM + Rules)
Intent Encoding (BERT)
Intent Alignment & Training (GNN + KL Divergence)

System Modules

Candidate Generator

Identify top-K items using structural patterns to ground LLM reasoning

Model or implementation: Pre-trained GNN (e.g., SR-GNN, GCE-GNN)

Intent Reasoner (Intent Inference & Classification)

Infer multi-semantic intents from session text and candidate items

Model or implementation: Qwen2.5-7B-Instruct

Intent Encoder (Intent Inference & Classification)

Convert text-based intents into vector embeddings

Model or implementation: BERT (pre-trained)

Recommender & Aligner

Fuse structural and semantic intents and predict next item

Model or implementation: GNN (trainable) with KL Divergence loss

Novel Architectural Elements

Two-stage intent extraction where a structural model (GNN) first filters candidates to prompt a semantic model (LLM)
Dual-pipeline alignment mechanism synchronizing explicit/latent semantic embeddings with structural graph embeddings via KL divergence

Modeling

Base Model: Qwen2.5-7B-Instruct (for reasoning), BERT (for encoding), various GNNs (for structure)

Training Method: Joint training of GNN recommender with alignment loss

Objective Functions:

Purpose: Optimize recommendation accuracy.

Formally: Cross-entropy loss L_rec between predicted item probabilities and ground truth.
Purpose: Align semantic and structural representations.

Formally: KL divergence loss L_align between GNN representations and BERT-encoded intent embeddings.

Adaptation: GNN parameters are updated; LLM (Qwen) is used for inference only (frozen/prompt-based); BERT is pre-trained

Training Data:

Datasets: Beauty (Amazon), ML-1M (MovieLens)
Split: Train/Validation/Test (proportions not explicitly detailed in summary, typically chronological)

Key Hyperparameters:

top_k_candidates: Not explicitly reported in the paper
GNN_batch_size: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLM4SBR: LLM-DMsRec uses a candidate item set from a GNN to ground the LLM, whereas LLM4SBR reasons directly from the sequence without this structural filter.
vs. LLMGR: LLM-DMsRec does not fine-tune the LLM (saving compute) but aligns embeddings via a separate encoder (BERT), whereas LLMGR fine-tunes the LLM on graph data.
vs. GCE-GNN: Adds a semantic layer derived from LLMs to the purely structural GNN approach.

Limitations

Dependence on the quality of the pre-trained GNN for candidate generation; poor candidates may mislead the LLM
Inference latency due to LLM prompting and BERT encoding is likely higher than pure GNN methods (though not quantified)
No statistical significance tests reported for the improvements

Reproducibility

Code: https://github.com/nsswtt/LLM-DMsRec

Code is publicly available at https://github.com/nsswtt/LLM-DMsRec. The paper specifies the exact LLM (Qwen2.5-7B-Instruct) and datasets (Beauty, ML-1M) used. Hyperparameters like batch size and learning rate are missing from the text.

📊 Experiments & Results

Evaluation Setup

Next-item prediction on session datasets

Benchmarks:

Beauty (E-commerce session recommendation)
ML-1M (Movie recommendation)

Metrics:

P@20 (Precision@20)
MRR@20 (Mean Reciprocal Rank@20)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparison on the Beauty dataset shows LLM-DMsRec consistently outperforming baselines.
Beauty	P@20	0.3662	0.3855	+0.0193
Beauty	MRR@20	0.1988	0.2132	+0.0144
Performance comparison on the ML-1M dataset shows similar improvements.
ML-1M	P@20	0.3015	0.3120	+0.0105
ML-1M	MRR@20	0.1705	0.1768	+0.0063

Experiment Figures

Comparison of traditional GNN methods vs. LLM-DMsRec and a motivating example of multi-faceted intent.

Distribution of explicit vs. latent intents on the ML-1M dataset.

Main Takeaways

The proposed method can be seamlessly integrated into various GNN backbones (SR-GNN, GCE-GNN, DHCN) and consistently improves their performance.
Categorizing intents into explicit and latent types allows the model to capture more complex user behaviors than single-intent models.
Using the candidate item set as a knowledge base effectively aligns LLM reasoning with the actual item space, likely reducing the noise from irrelevant LLM outputs.

📚 Prerequisite Knowledge

Prerequisites

Graph Neural Networks (GNNs) for recommendation
Basic understanding of Large Language Models (LLMs) and prompting
Contrastive learning or alignment concepts (KL divergence)

Key Terms

SBR: Session-based Recommendation—predicting the next user action based on short-term anonymous interaction history

GNN: Graph Neural Network—neural networks designed to process data represented as graphs, capturing structural dependencies between items

LLM: Large Language Model—massive AI models trained on text that can perform reasoning and generation tasks

Hallucination: A phenomenon where an LLM generates plausible-sounding but factually incorrect or non-existent information

KL divergence: Kullback-Leibler divergence—a statistical distance measure used here to align the probability distributions of the semantic (LLM) and structural (GNN) representations

Explicit intent: User intentions that are directly observable or stated, derived from the item properties

Latent intent: Hidden or implied user intentions inferred from the sequence context that may not be immediately obvious

MRR@20: Mean Reciprocal Rank at 20—a metric that evaluates how high the correct item appears in the top-20 recommendations

P@20: Precision at 20—the proportion of relevant items found in the top-20 recommendations

ID sequence: The sequence of unique identifiers for items, used by traditional recommendation models to track interactions without semantic understanding