Continual Low-Rank Adapters for LLM-based Generative Recommender Systems

📝 Paper Summary

Continual Learning Generative Recommendation Parameter-Efficient Fine-Tuning (PEFT)

PESO enables large language models to adapt to evolving user preferences in recommendation by maintaining a single LoRA adapter that is mathematically anchored to its previous state, preventing catastrophic forgetting without freezing outdated knowledge.

Core Problem

Standard continual learning methods (like cumulative LoRA) assume tasks are disjoint and aim to preserve all past knowledge, but in recommendation, user preferences evolve and old preferences (e.g., outgrown hobbies) can actively degrade performance if forcefully preserved.

Why it matters:

Real-world recommendation data arrives sequentially, making retraining from scratch inefficient
Existing cumulative methods from computer vision fail in recommendation because they entangle outdated preferences with relevant ones
Outdated preferences must sometimes be overwritten (plasticity) rather than preserved (stability) to capture current user interests accurately

Concrete Example: A user who previously watched action movies but shifted to romance will be recommended irrelevant action titles if the model rigidly preserves the old 'action' adapter. PESO allows the old preference to fade if recent data doesn't support it, while retaining stable long-term interests.

Key Novelty

Proximally Regularized Single Evolving LoRA (PESO)

Rejects the 'cumulative adapter' approach (stacking frozen modules) in favor of a single evolving adapter to avoid entangling outdated knowledge
Introduces a Softmax-KL proximal regularization term that acts as a 'soft anchor,' pulling the adapter towards its previous state only when new data doesn't strongly suggest a change
Theoretically proves that this regularization provides data-aware, direction-wise guidance, updating parameters along directions supported by new data while freezing unsupported directions

Architecture

Conceptual comparison of PESO vs. Cumulative LoRA strategies. Shows PESO maintaining a single adapter v_t anchored to v_{t-1}, while Cumulative LoRA stacks frozen history.

Evaluation Highlights

Demonstrates that Cumulative LoRA (summing past adapters) performs worse than a single evolving adapter on natural chronological splits, contradicting findings in computer vision
The proposed Softmax-KL proximal regularizer functions as an 'app-weighted variance' penalty, preserving internal module structure better than standard L2 regularization
Analysis confirms that parameter inheritance (initializing from the previous stage) is critical for performance, while aggregating old frozen adapters hinders adaptation to evolving preferences

Breakthrough Assessment

7/10

Provides a strong theoretical correction to the blind application of Computer Vision continual learning techniques (Cumulative LoRA) to Recommendation, identifying why they fail and proposing a mathematically grounded alternative.

⚙️ Technical Details

Problem Definition

Setting: Continual fine-tuning of a generative recommender on chronologically arriving data blocks D_2, ..., D_T

Inputs: User interaction sequence history (x_{u,1}, ..., x_{u,N_u})

Outputs: Next item token y (Semantic ID)

Pipeline Flow

Input Processing: Tokenize user history items into Semantic IDs
LLM Forward Pass: Process tokens through pretrained LLM with LoRA injected in attention/MLP layers
Regularization (Training only): Compute Softmax-KL divergence between current LoRA weights and frozen previous weights
Output: Generate next item token

System Modules

Base LLM

Provide pretrained language understanding and sequence modeling capabilities

Model or implementation: Pretrained Transformer (frozen weights W_0)

PESO Adapter (LoRA)

Capture evolving user preferences via low-rank updates

Model or implementation: Single evolving LoRA (Matrices A_t, B_t)

Proximal Regularizer

Penalize deviations from the previous stage's knowledge structure

Model or implementation: Softmax-KL Divergence

Novel Architectural Elements

Softmax-KL Proximal Regularization applied directly to LoRA parameter matrices (treating flattened parameters as distributions) to preserve module-wise structure

Modeling

Base Model: LLM-based Recommender (Specific backbone not detailed in provided text)

Training Method: Continual Fine-tuning with Proximal Regularization

Objective Functions:

Purpose: Optimize for next-item prediction on current data.

Formally: L_{ce} = CrossEntropy(prediction, target)
Purpose: Anchor adapter to previous state to prevent forgetting.

Formally: L_{reg} = λ * D_{KL}(softmax(v_{t-1}) || softmax(v_t))
Purpose: Combined Training Objective.

Formally: L_t = L_{ce} + L_{reg}

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: LoRA matrices A and B only (Base LLM is frozen)

Training Data:

Amazon Review (Musical Instruments) dataset split into chronological blocks
Base data D_1 (60%) for pretraining, D_2...D_5 (10% each) for continual phases

Compute: Not reported in the paper

Comparison to Prior Work

vs. Single Evolving LoRA: PESO adds a proximal term to prevent catastrophic forgetting while allowing adaptation, whereas Single Evolving overwrites past knowledge uncontrolled.
vs. Cumulative LoRA: PESO uses a single adapter to allow 'unlearning' of outdated preferences, whereas Cumulative LoRA permanently bakes in old adapters, harming performance when preferences drift.
vs. L2-Regularized LoRA [implied baseline]: PESO uses Softmax-KL which respects the internal structure/magnitude of parameters ('app-weighted variance'), unlike L2 which treats all parameters uniformly.

Limitations

Relies on a fixed item tokenizer (Semantic ID); does not address how to handle completely new items that don't fit the existing codebook
Theoretical analysis assumes a quadratic approximation of the loss landscape
Comparison is primarily against LoRA variants; full fine-tuning baselines are discussed as inefficient but not extensively compared in the provided text

Reproducibility

Code: https://github.com/hsyoo32/peso

Code is publicly available at https://github.com/hsyoo32/peso. The provided text details the exact mathematical formulation of the proximal term and the data splitting strategy (natural chronological vs. user-disjoint). Specific hyperparameters (learning rate, rank) and base LLM size are not in the provided excerpt.

📊 Experiments & Results

Evaluation Setup

Next-item prediction on Amazon Review (Musical Instruments) dataset

Benchmarks:

Amazon Review (Musical Instruments) (Sequential Recommendation)

Metrics:

Not explicitly reported in the provided text (likely Recall@K or NDCG@K given the domain)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Cumulative LoRA (SumLoRA) performs significantly worse on Natural Chronological splits (evolving preferences) than on User-Disjoint splits, indicating its design is ill-suited for capturing preference evolution.
Parameter inheritance (initializing v_t from v_{t-1}) is essential for performance; summing frozen past adapters without inheritance leads to the worst performance.
Fixed-magnitude summation (SumLoRA) and learnable-magnitude summation (SD-LoRA) both struggle because frozen adapters entangle useful and outdated knowledge, making them hard to disentangle.
PESO's proximal design allows the model to balance stability and plasticity based on data support: where data support is strong (large eigenvalues), the model adapts; where weak, it retains the previous state.

📚 Prerequisite Knowledge

Prerequisites

Low-Rank Adaptation (LoRA)
Continual Learning (Stability-Plasticity Dilemma)
Generative Recommendation
Proximal Optimization
Kullback-Leibler (KL) Divergence

Key Terms

LoRA: Low-Rank Adaptation—a technique to fine-tune large models by training only small low-rank matrices while keeping the base model frozen

Semantic ID: A method of representing items as sequences of tokens derived from their semantic features (e.g., title/description), often using a hierarchical tokenizer

Proximal Regularizer: A penalty term in the loss function that keeps the new model parameters close to the previous version to prevent drastic changes (forgetting)

Cumulative LoRA: A family of methods that freeze past adapters and sum them with a new trainable adapter; popular in vision but shown here to be harmful for recommendation

Plasticity: The ability of the model to learn new patterns (e.g., a user's new interest in cooking)

Stability: The ability of the model to retain useful old patterns (e.g., a user's long-term interest in sci-fi)

Softmax-KL Proximal: A specific regularization term using KL divergence between the softmax outputs of the adapter weights, preserving the structural distribution of the weights