IRLLRec aligns explicit textual intents from LLM summaries with implicit behavioral intents from interaction graphs using dual-tower contrastive learning and momentum distillation to improve recommendation accuracy.
Core Problem
Existing intent-based recommenders struggle to align the distinct representation spaces of textual data (reviews/descriptions) and interaction data (clicks), often leading to noise and misalignment.
Why it matters:
Implicit interaction data is sparse and noisy, often failing to capture fine-grained user motivations (e.g., buying for 'skin sensitivity' vs. generic popularity).
Rich textual data contains explicit preferences but exists in a different semantic space than collaborative filtering signals, making direct fusion difficult.
Mismatched intents across modalities result in suboptimal recommendations, as behavioral drivers (interactions) and expressed preferences (text) may diverge or contain unique noise.
Concrete Example:A user might interact with a restaurant (implicit intent: busy location) but write a review praising 'diverse menu options' (explicit intent). Current models struggle to match these distinct signals; noise like 'parking difficulties' in text might be irrelevant to the core interaction driver, confusing the recommender.
Uses a dual-tower architecture to separately encode textual intents (via LLM summaries) and interaction intents (via Graph Neural Networks), then forces them into a shared space using contrastive alignment.
Employs a momentum-based teacher-student framework (Interaction-Text Matching) where evolving teacher encoders guide student encoders to identify and match key latent intents, filtering out noise.
Introduces 'Translation Alignment' which adds noise perturbations to representations to make the alignment robust against inherent input feature noise.
Architecture
The overall IRLLRec framework, detailing the flow from multimodal inputs (text and graph) to intent extraction, dual-tower encoding, alignment strategies, and final prediction.
Evaluation Highlights
Outperforms state-of-the-art baselines (including RLMRec and DCCF) on three public datasets (Amazon-Book, Yelp, Steam) across Recall and NDCG metrics.
Achieves significant performance gains in sparse data scenarios, demonstrating the benefit of supplementing interactions with textual intent.
Ablation studies confirm that both the Intent Alignment (IA) and Interaction-Text Matching (ITM) modules are critical, with removal leading to performance drops.
Breakthrough Assessment
7/10
Solid contribution to multimodal recommendation by effectively combining LLM-based text summarization with GNN-based collaborative filtering via novel alignment and distillation techniques. While the components (CL, distillation) are known, the specific application to intent disentanglement is well-executed.
⚙️ Technical Details
Problem Definition
Setting: Top-K Recommendation using implicit feedback and textual side information
Inputs: User set U, Item set I, Interaction matrix R, Textual data (reviews/descriptions)
Outputs: Predicted probability of interaction between user u and item i
learning_rate: Not explicitly reported in the paper
Compute: Not reported in the paper
Comparison to Prior Work
vs. KGIN: KGIN uses KGs for intent; IRLLRec uses LLM-summarized text and interaction graphs.
vs. DCCF: DCCF focuses only on interaction disentanglement; IRLLRec aligns multimodal (text + interaction) intents.
vs. RLMRec: RLMRec aligns semantic and interaction spaces globally; IRLLRec focuses on fine-grained *intent* alignment and noise mitigation via momentum distillation.
Code is publicly available at https://github.com/wangyu0627/IRLLRec. The paper describes the Chain-of-Thought prompts and mathematical formulations in detail. However, specific hyperparameters (learning rate, batch size, embedding dimension) and computational resources (GPU type) are not explicitly detailed in the text provided.
📊 Experiments & Results
Evaluation Setup
Top-K Recommendation on sparse datasets
Benchmarks:
Amazon-Book (E-commerce Recommendation)
Yelp (Business/Restaurant Recommendation)
Steam (Game Recommendation)
Metrics:
Recall@K
NDCG@K
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
Concept illustration of multimodal intents. (a-b) show user-item interaction intents. (c) shows the distribution gap between profile and intent representations. (d) illustrates the matching of specific textual feedback (likes/dislikes) to interaction intents.
Main Takeaways
IRLLRec consistently outperforms baselines across all three datasets, validating the effectiveness of integrating LLM-derived textual intents with collaborative filtering.
Ablation studies show that removing the Interaction-Text Matching (ITM) module causes a performance drop, confirming the importance of filtering noise and matching key intents.
The Translation Alignment strategy (adding noise) contributes to robustness, as evidenced by performance gains over simple pairwise alignment.
The model shows improved performance in sparse interaction scenarios compared to models that rely solely on interactions.
📚 Prerequisite Knowledge
Prerequisites
Collaborative Filtering (CF) and Graph Neural Networks (LightGCN)
Contrastive Learning (InfoNCE loss)
Knowledge Distillation (Teacher-Student models)
Large Language Models (for text summarization)
Key Terms
Intent Disentanglement: Separating user/item representations into distinct vectors representing different underlying motivations (e.g., price, quality, brand).
CoT: Chain of Thought—prompting strategy that encourages LLMs to reason step-by-step to generate better summaries.
Momentum Distillation: A training technique where a 'teacher' model is updated as a moving average of the 'student' model's weights to provide stable pseudo-labels.
LightGCN: A simplified Graph Convolutional Network for recommendation that removes non-linearities and feature transformation to focus on neighborhood aggregation.
InfoNCE: A contrastive loss function that maximizes similarity between positive pairs while minimizing similarity with negative pairs.
GSL: Graph Structure Learning—techniques to refine the graph topology (e.g., removing noisy edges) during training.
Translation Alignment: An alignment strategy involving adding Gaussian noise to embeddings to simulate sample shifting and improve robustness.