X Kong, J Wu, A Zhang, L Sheng, H Lin, X Wang, X He
National University of Singapore,
Electronic Science Research Institute of China Electronics Technology Group Corporation
arXiv, 8/2024
(2024)
RecommendationP13N
📝 Paper Summary
Sequential RecommendationLarge Language Models (LLMs) for RecommendationParameter-Efficient Fine-Tuning (PEFT)
iLoRA treats sequential recommendation as multi-task learning by dynamically assembling a personalized Low-Rank Adaptation (LoRA) module for each user sequence using a mixture of experts to capture individual behavioral variability.
Core Problem
Standard LoRA fine-tuning applies a uniform set of parameters across all user sequences, ignoring the significant variability in individual behaviors and causing negative transfer between dissimilar sequences.
Why it matters:
User behaviors exhibit distinct interests and patterns; forcing a single model adaptation to handle all variations leads to suboptimal performance.
Unrelated tasks (or dissimilar user sequences) exhibit different gradient trajectories, leading to conflicts and negative transfer when using shared parameters.
Existing methods focus on prompt engineering but leave the fine-tuning mechanism static, limiting the model's ability to adapt to diverse user needs.
Concrete Example:In LLaRA, gradients from distant user clusters in the collaborative space are misaligned (Figure 1). A uniform LoRA module tries to satisfy conflicting updates from these dissimilar sequences, resulting in a 'Jack of all trades, master of none' effect where the model fails to specialize for either user type.
Key Novelty
Instance-wise LoRA (iLoRA)
Replaces the standard single LoRA matrices with a bank of 'expert' sub-matrices, where each expert specializes in different latent aspects of user behavior.
Uses a gating network, guided by a dense representation of the user's history (from a standard recommender like SASRec), to calculate dynamic attention scores for each instance.
Aggregates these experts on-the-fly to create a unique, instance-specific LoRA module for every input sequence without increasing the total inference parameter count compared to standard LoRA.
Architecture
The iLoRA framework. It shows how a user sequence is processed by SASRec to get a representation z, which is then used by a Gating Network to output weights ω. These weights combine multiple LoRA experts (A_k, B_k) into specific A and B matrices for the LLM.
Evaluation Highlights
Achieves an average relative improvement of 11.4% in Hit Ratio over basic LoRA across three datasets.
Outperforms state-of-the-art LLM-based method LLaRA and traditional methods like SASRec on LastFM, MovieLens, and Steam datasets.
Accomplishes these gains with less than a 1% relative increase in trainable parameters compared to standard LoRA.
Breakthrough Assessment
7/10
Offers a smart, parameter-efficient application of MoE to LoRA for recommendation. While the architectural components (LoRA, MoE) are known, their combination to solve the specific 'negative transfer in sequential recommendation' problem is novel and effective.
⚙️ Technical Details
Problem Definition
Setting: Sequential recommendation as an autoregressive generation task.
Inputs: A sequence of historical items i_<n = [i_1, ..., i_{n-1}] converted into a hybrid prompt x combining textual and behavioral tokens.
Outputs: The textual description y of the next item i_n of interest.
Pipeline Flow
Sequence Encoder (SASRec): Generates dense sequence representation z
Gating Network: Computes instance-wise expert weights ω from z
Expert Aggregation: Assembles instance-specific LoRA matrices A and B
LLM Inference: Llama-2 processes hybrid prompt using the assembled LoRA
System Modules
Sequence Encoder
Extract a holistic representation of user behavior patterns to guide the expert selection
Model or implementation: SASRec (pre-trained)
Gating Network
Calculate attention scores for experts based on the sequence representation
Model or implementation: Linear projection + Softmax
iLoRA Module
Dynamic parameter adaptation for the LLM attention layers
Model or implementation: Mixture of Low-Rank Matrices
Base LLM
Generate the next item prediction
Model or implementation: Llama-2-7B
Novel Architectural Elements
Split-LoRA Architecture: Dividing standard LoRA matrices A and B into K sub-matrices (experts) to capture different latent behaviors.
Instance-Guided Gating: Using an external recommender's embedding (SASRec) to drive the gating function for an LLM adapter, rather than using the LLM's own internal states.
Modeling
Base Model: Llama-2-7B
Training Method: Supervised Fine-Tuning (Instruction Tuning) with iLoRA
Objective Functions:
Purpose: Maximize the likelihood of the correct next item token sequence.
Formally: Autoregressive language modeling loss L = - Σ log P(y_t | y_<t, x; φ + Δφ(i_<n))
Adaptation: Instance-wise LoRA (iLoRA)
Trainable Parameters: Only iLoRA parameters (experts + gating) and behavioral projector are trained; Base LLM is frozen.
Key Hyperparameters:
LoRA_rank_r: Not explicitly reported in the paper
Number_of_experts_K: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper
Compute: Maintains same parameter count as standard LoRA (negligible increase < 1%).
Comparison to Prior Work
vs. LLaRA: iLoRA uses dynamic, instance-specific LoRA weights via MoE instead of a single static LoRA module.
vs. TALLRec: iLoRA incorporates behavioral tokens and dynamic adaptation, whereas TALLRec uses static LoRA on text only.
vs. MoRec: MoRec replaces item IDs with text but uses standard bert-like encoders; iLoRA uses generative LLMs with MoE adapters [not cited in paper].
Limitations
Relies on a pre-trained sequential recommender (SASRec) for gating signals, introducing a dependency.
Inference complexity might be slightly higher than standard LoRA due to the gating computation and weight aggregation per instance (though parameters are similar).
Code and data are publicly available at https://github.com/AkaliKong/iLoRA. The paper explicitly states maintaining the same experimental settings as LLaRA[9].
📊 Experiments & Results
Evaluation Setup
Next-item prediction on sequential recommendation datasets.
Benchmarks:
LastFM (Music Artist Recommendation)
MovieLens (Movie Recommendation)
Steam (Game Recommendation)
Metrics:
Hit Ratio (HR)
NDCG
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Average across 3 datasets
Hit Ratio (HR)
Not explicitly reported in the paper
Not explicitly reported in the paper
-
Experiment Figures
Gradient similarity heatmap for LLaRA (standard LoRA) across different user sequences.
Main Takeaways
iLoRA consistently outperforms standard LoRA (LLaRA) and traditional baselines (SASRec, etc.) across all datasets.
The method effectively mitigates negative transfer by disentangling diverse user behaviors into expert sub-spaces.
The improvement is achieved with negligible parameter overhead, validating the efficiency of the MoE-LoRA design.
📚 Prerequisite Knowledge
Prerequisites
Low-Rank Adaptation (LoRA) for LLMs
Sequential Recommendation (SASRec, GRU4Rec)
Mixture of Experts (MoE)
Instruction Tuning
Key Terms
LoRA: Low-Rank Adaptation—a PEFT method that injects trainable low-rank matrices into transformer layers to approximate weight updates while freezing the base model.
PEFT: Parameter-Efficient Fine-Tuning—techniques to adapt large pre-trained models with minimal parameter updates.
MoE: Mixture of Experts—an architecture where different parts of the model (experts) are activated for different inputs.
Negative Transfer: A phenomenon in multi-task learning where training on one task degrades performance on another due to conflicting gradient updates.
SASRec: Self-Attentive Sequential Recommendation—a transformer-based model for sequential recommendation used here to generate guidance representations.
Hybrid Prompting: Combining text tokens (from LLM tokenizer) with behavioral tokens (learned item embeddings from a recommender) in the input prompt.
Gating Network: A mechanism that computes attention weights (probabilities) to determine how much each expert contributes to the final output.