Intent Representation Learning with Large Language Model for Recommendation

📝 Paper Summary

LLM-enhanced Recommendation Intent Disentanglement

IRLLRec aligns explicit textual intents from LLM summaries with implicit behavioral intents from interaction graphs using dual-tower contrastive learning and momentum distillation to improve recommendation accuracy.

Core Problem

Existing intent-based recommenders struggle to align the distinct representation spaces of textual data (reviews/descriptions) and interaction data (clicks), often leading to noise and misalignment.

Why it matters:

Implicit interaction data is sparse and noisy, often failing to capture fine-grained user motivations (e.g., buying for 'skin sensitivity' vs. generic popularity).
Rich textual data contains explicit preferences but exists in a different semantic space than collaborative filtering signals, making direct fusion difficult.
Mismatched intents across modalities result in suboptimal recommendations, as behavioral drivers (interactions) and expressed preferences (text) may diverge or contain unique noise.

Concrete Example: A user might interact with a restaurant (implicit intent: busy location) but write a review praising 'diverse menu options' (explicit intent). Current models struggle to match these distinct signals; noise like 'parking difficulties' in text might be irrelevant to the core interaction driver, confusing the recommender.

Key Novelty

Dual-Tower Intent Alignment & Momentum Distillation

Uses a dual-tower architecture to separately encode textual intents (via LLM summaries) and interaction intents (via Graph Neural Networks), then forces them into a shared space using contrastive alignment.
Employs a momentum-based teacher-student framework (Interaction-Text Matching) where evolving teacher encoders guide student encoders to identify and match key latent intents, filtering out noise.
Introduces 'Translation Alignment' which adds noise perturbations to representations to make the alignment robust against inherent input feature noise.

Architecture

The overall IRLLRec framework, detailing the flow from multimodal inputs (text and graph) to intent extraction, dual-tower encoding, alignment strategies, and final prediction.

Evaluation Highlights

Outperforms state-of-the-art baselines (including RLMRec and DCCF) on three public datasets (Amazon-Book, Yelp, Steam) across Recall and NDCG metrics.
Achieves significant performance gains in sparse data scenarios, demonstrating the benefit of supplementing interactions with textual intent.
Ablation studies confirm that both the Intent Alignment (IA) and Interaction-Text Matching (ITM) modules are critical, with removal leading to performance drops.

Breakthrough Assessment

7/10

Solid contribution to multimodal recommendation by effectively combining LLM-based text summarization with GNN-based collaborative filtering via novel alignment and distillation techniques. While the components (CL, distillation) are known, the specific application to intent disentanglement is well-executed.

⚙️ Technical Details

Problem Definition

Setting: Top-K Recommendation using implicit feedback and textual side information

Inputs: User set U, Item set I, Interaction matrix R, Textual data (reviews/descriptions)

Outputs: Predicted probability of interaction between user u and item i

Pipeline Flow

Textual Intent Extraction (LLM-based summarization)
Intent Encoding (Dual-tower: MLP for text, LightGCN for interactions)
Intent Alignment (Pairwise + Translation strategies)
Interaction-Text Matching (Momentum Distillation)
Prediction (Inner product of fused representations)

System Modules

Textual Intent Extractor

Summarize raw text (reviews) into fine-grained user/item intents

Model or implementation: LLM (specific model not named in text, likely GPT or open-source equivalent)

Graph Encoder (Representation Learning)

Capture collaborative signals and model interaction intents

Model or implementation: LightGCN with Intent Prototypes

Text Encoder (Representation Learning)

Map textual intent embeddings into the shared recommendation space

Model or implementation: MLP (Linear Mapping)

Intent Alignment Module (Alignment)

Align text and interaction spaces using contrastive learning

Model or implementation: InfoNCE Loss + Noise Injection

Interaction-Text Matching Module (Alignment)

Refine alignment by matching latent key intents using distillation

Model or implementation: Momentum Distillation (Teacher-Student)

Novel Architectural Elements

Dual-tower Intent Alignment combining explicit textual intent (LLM-derived) and implicit interaction intent (GNN-derived).
Integration of Momentum Distillation specifically for cross-modal intent matching in recommendation.

Modeling

Base Model: LightGCN (backbone) + LLM (for preprocessing)

Training Method: Multi-task learning combining BPR loss, Alignment loss, and Matching loss

Objective Functions:

Purpose: Optimize recommendation ranking.

Formally: BPR Loss (Bayesian Personalized Ranking).
Purpose: Align multimodal representations.

Formally: InfoNCE Loss (Pairwise) + Noise-perturbed Contrastive Loss (Translation).
Purpose: Distill knowledge from stable momentum teachers to students.

Formally: KL Divergence between teacher and student similarity distributions.

Key Hyperparameters:

momentum_coefficient: 0.999
matching_loss_weight: 0.4
regularization_L2: Not explicitly reported in the paper
+ 1 more
learning_rate: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. KGIN: KGIN uses KGs for intent; IRLLRec uses LLM-summarized text and interaction graphs.
vs. DCCF: DCCF focuses only on interaction disentanglement; IRLLRec aligns multimodal (text + interaction) intents.
vs. RLMRec: RLMRec aligns semantic and interaction spaces globally; IRLLRec focuses on fine-grained *intent* alignment and noise mitigation via momentum distillation.
+ 1 more
vs. AlphaRec: AlphaRec replaces IDs with language embeddings; IRLLRec keeps ID embeddings but enhances them with aligned textual intents.

Limitations

Relies on the quality of LLM-generated summaries; poor prompts or hallucinations could degrade textual intent quality.
Complexity analysis suggests computational cost scales with the number of latent intents and encoder layers.
Requires both rich textual data and interaction data; may be less effective in datasets with missing reviews or descriptions.
Momentum distillation adds memory overhead by maintaining shadow copies of encoders.

Reproducibility

Code: https://github.com/wangyu0627/IRLLRec

Code is publicly available at https://github.com/wangyu0627/IRLLRec. The paper describes the Chain-of-Thought prompts and mathematical formulations in detail. However, specific hyperparameters (learning rate, batch size, embedding dimension) and computational resources (GPU type) are not explicitly detailed in the text provided.

📊 Experiments & Results

Evaluation Setup

Top-K Recommendation on sparse datasets

Benchmarks:

Amazon-Book (E-commerce Recommendation)
Yelp (Business/Restaurant Recommendation)
Steam (Game Recommendation)

Metrics:

Recall@K
NDCG@K
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Concept illustration of multimodal intents. (a-b) show user-item interaction intents. (c) shows the distribution gap between profile and intent representations. (d) illustrates the matching of specific textual feedback (likes/dislikes) to interaction intents.

Main Takeaways

IRLLRec consistently outperforms baselines across all three datasets, validating the effectiveness of integrating LLM-derived textual intents with collaborative filtering.
Ablation studies show that removing the Interaction-Text Matching (ITM) module causes a performance drop, confirming the importance of filtering noise and matching key intents.
The Translation Alignment strategy (adding noise) contributes to robustness, as evidenced by performance gains over simple pairwise alignment.
The model shows improved performance in sparse interaction scenarios compared to models that rely solely on interactions.

📚 Prerequisite Knowledge

Prerequisites

Collaborative Filtering (CF) and Graph Neural Networks (LightGCN)
Contrastive Learning (InfoNCE loss)
Knowledge Distillation (Teacher-Student models)
Large Language Models (for text summarization)

Key Terms

Intent Disentanglement: Separating user/item representations into distinct vectors representing different underlying motivations (e.g., price, quality, brand).

CoT: Chain of Thought—prompting strategy that encourages LLMs to reason step-by-step to generate better summaries.

Momentum Distillation: A training technique where a 'teacher' model is updated as a moving average of the 'student' model's weights to provide stable pseudo-labels.

LightGCN: A simplified Graph Convolutional Network for recommendation that removes non-linearities and feature transformation to focus on neighborhood aggregation.

InfoNCE: A contrastive loss function that maximizes similarity between positive pairs while minimizing similarity with negative pairs.

GSL: Graph Structure Learning—techniques to refine the graph topology (e.g., removing noisy edges) during training.

Translation Alignment: An alignment strategy involving adding Gaussian noise to embeddings to simulate sample shifting and improve robustness.