Customizing Language Models with Instance-wise LoRA for Sequential Recommendation

📝 Paper Summary

Sequential Recommendation Large Language Models (LLMs) for Recommendation Parameter-Efficient Fine-Tuning (PEFT)

iLoRA treats sequential recommendation as multi-task learning by dynamically assembling a personalized Low-Rank Adaptation (LoRA) module for each user sequence using a mixture of experts to capture individual behavioral variability.

Core Problem

Standard LoRA fine-tuning applies a uniform set of parameters across all user sequences, ignoring the significant variability in individual behaviors and causing negative transfer between dissimilar sequences.

Why it matters:

User behaviors exhibit distinct interests and patterns; forcing a single model adaptation to handle all variations leads to suboptimal performance.
Unrelated tasks (or dissimilar user sequences) exhibit different gradient trajectories, leading to conflicts and negative transfer when using shared parameters.
Existing methods focus on prompt engineering but leave the fine-tuning mechanism static, limiting the model's ability to adapt to diverse user needs.

Concrete Example: In LLaRA, gradients from distant user clusters in the collaborative space are misaligned (Figure 1). A uniform LoRA module tries to satisfy conflicting updates from these dissimilar sequences, resulting in a 'Jack of all trades, master of none' effect where the model fails to specialize for either user type.

Key Novelty

Instance-wise LoRA (iLoRA)

Replaces the standard single LoRA matrices with a bank of 'expert' sub-matrices, where each expert specializes in different latent aspects of user behavior.
Uses a gating network, guided by a dense representation of the user's history (from a standard recommender like SASRec), to calculate dynamic attention scores for each instance.
Aggregates these experts on-the-fly to create a unique, instance-specific LoRA module for every input sequence without increasing the total inference parameter count compared to standard LoRA.

Architecture

The iLoRA framework. It shows how a user sequence is processed by SASRec to get a representation z, which is then used by a Gating Network to output weights ω. These weights combine multiple LoRA experts (A_k, B_k) into specific A and B matrices for the LLM.

Evaluation Highlights

Achieves an average relative improvement of 11.4% in Hit Ratio over basic LoRA across three datasets.
Outperforms state-of-the-art LLM-based method LLaRA and traditional methods like SASRec on LastFM, MovieLens, and Steam datasets.
Accomplishes these gains with less than a 1% relative increase in trainable parameters compared to standard LoRA.

Breakthrough Assessment

7/10

Offers a smart, parameter-efficient application of MoE to LoRA for recommendation. While the architectural components (LoRA, MoE) are known, their combination to solve the specific 'negative transfer in sequential recommendation' problem is novel and effective.

⚙️ Technical Details

Problem Definition

Setting: Sequential recommendation as an autoregressive generation task.

Inputs: A sequence of historical items i_<n = [i_1, ..., i_{n-1}] converted into a hybrid prompt x combining textual and behavioral tokens.

Outputs: The textual description y of the next item i_n of interest.

Pipeline Flow

Sequence Encoder (SASRec): Generates dense sequence representation z
Gating Network: Computes instance-wise expert weights ω from z
Expert Aggregation: Assembles instance-specific LoRA matrices A and B
LLM Inference: Llama-2 processes hybrid prompt using the assembled LoRA

System Modules

Sequence Encoder

Extract a holistic representation of user behavior patterns to guide the expert selection

Model or implementation: SASRec (pre-trained)

Gating Network

Calculate attention scores for experts based on the sequence representation

Model or implementation: Linear projection + Softmax

iLoRA Module

Dynamic parameter adaptation for the LLM attention layers

Model or implementation: Mixture of Low-Rank Matrices

Base LLM

Generate the next item prediction

Model or implementation: Llama-2-7B

Novel Architectural Elements

Split-LoRA Architecture: Dividing standard LoRA matrices A and B into K sub-matrices (experts) to capture different latent behaviors.
Instance-Guided Gating: Using an external recommender's embedding (SASRec) to drive the gating function for an LLM adapter, rather than using the LLM's own internal states.

Modeling

Base Model: Llama-2-7B

Training Method: Supervised Fine-Tuning (Instruction Tuning) with iLoRA

Objective Functions:

Purpose: Maximize the likelihood of the correct next item token sequence.

Formally: Autoregressive language modeling loss L = - Σ log P(y_t | y_<t, x; φ + Δφ(i_<n))

Adaptation: Instance-wise LoRA (iLoRA)

Trainable Parameters: Only iLoRA parameters (experts + gating) and behavioral projector are trained; Base LLM is frozen.

Key Hyperparameters:

LoRA_rank_r: Not explicitly reported in the paper
Number_of_experts_K: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper

Compute: Maintains same parameter count as standard LoRA (negligible increase < 1%).

Comparison to Prior Work

vs. LLaRA: iLoRA uses dynamic, instance-specific LoRA weights via MoE instead of a single static LoRA module.
vs. TALLRec: iLoRA incorporates behavioral tokens and dynamic adaptation, whereas TALLRec uses static LoRA on text only.
vs. MoRec: MoRec replaces item IDs with text but uses standard bert-like encoders; iLoRA uses generative LLMs with MoE adapters [not cited in paper].

Limitations

Relies on a pre-trained sequential recommender (SASRec) for gating signals, introducing a dependency.
Inference complexity might be slightly higher than standard LoRA due to the gating computation and weight aggregation per instance (though parameters are similar).

Reproducibility

Code: https://github.com/AkaliKong/iLoRA

Code and data are publicly available at https://github.com/AkaliKong/iLoRA. The paper explicitly states maintaining the same experimental settings as LLaRA[9].

📊 Experiments & Results

Evaluation Setup

Next-item prediction on sequential recommendation datasets.

Benchmarks:

LastFM (Music Artist Recommendation)
MovieLens (Movie Recommendation)
Steam (Game Recommendation)

Metrics:

Hit Ratio (HR)
NDCG
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Average across 3 datasets	Hit Ratio (HR)	Not explicitly reported in the paper	Not explicitly reported in the paper	-

Experiment Figures

Gradient similarity heatmap for LLaRA (standard LoRA) across different user sequences.

Main Takeaways

iLoRA consistently outperforms standard LoRA (LLaRA) and traditional baselines (SASRec, etc.) across all datasets.
The method effectively mitigates negative transfer by disentangling diverse user behaviors into expert sub-spaces.
The improvement is achieved with negligible parameter overhead, validating the efficiency of the MoE-LoRA design.

📚 Prerequisite Knowledge

Prerequisites

Low-Rank Adaptation (LoRA) for LLMs
Sequential Recommendation (SASRec, GRU4Rec)
Mixture of Experts (MoE)
Instruction Tuning

Key Terms

LoRA: Low-Rank Adaptation—a PEFT method that injects trainable low-rank matrices into transformer layers to approximate weight updates while freezing the base model.

PEFT: Parameter-Efficient Fine-Tuning—techniques to adapt large pre-trained models with minimal parameter updates.

MoE: Mixture of Experts—an architecture where different parts of the model (experts) are activated for different inputs.

Negative Transfer: A phenomenon in multi-task learning where training on one task degrades performance on another due to conflicting gradient updates.

SASRec: Self-Attentive Sequential Recommendation—a transformer-based model for sequential recommendation used here to generate guidance representations.

Hybrid Prompting: Combining text tokens (from LLM tokenizer) with behavioral tokens (learned item embeddings from a recommender) in the input prompt.

Gating Network: A mechanism that computes attention weights (probabilities) to determine how much each expert contributes to the final output.