RecCocktail: A Generalizable and Efficient Framework for LLM-Based Recommendation

📝 Paper Summary

LLM-based Recommendation Parameter-Efficient Fine-Tuning (PEFT) Domain Adaptation

RecCocktail merges a general-purpose recommendation LoRA module with a domain-specific LoRA module via linear weight arithmetic to achieve both generalization and domain adaptability without extra inference cost.

Core Problem

Current LLM-based recommenders typically focus on either breadth (generalization via multi-domain data) or depth (domain-specific tuning), failing to simultaneously handle new domains and maximize performance on specific ones.

Why it matters:

Breadth-oriented models often underperform in specific domains due to lack of deep alignment.
Depth-oriented models struggle with distribution shifts, cold-start scenarios, and new domains where training data is sparse.
Existing solutions like ensembling outputs increase inference latency, while sequential fine-tuning risks catastrophic forgetting.

Concrete Example: A model trained on general e-commerce data might understand shopping but fail to capture the specific nuances of 'MovieLens' user behavior. Conversely, a model fine-tuned only on MovieLens fails completely when transferred to a new 'Toys' domain without retraining.

Key Novelty

LoRA Cocktail (Weight Space Merging)

Treats LoRA adapters as 'task vectors' that can be linearly combined in weight space, merging a 'base spirit' (general knowledge) and an 'ingredient' (domain-specific knowledge).
Introduces an entropy-guided adaptive merging strategy that tunes the mixing coefficients at test time using unlabeled data to minimize prediction uncertainty.

Architecture

The three-stage framework of RecCocktail: (a) Preparing Base Spirit via general instruction tuning, (b) Preparing Ingredient via domain-specific tuning, and (c) Making Cocktail via entropy-guided weight merging.

Evaluation Highlights

Outperforms state-of-the-art LLM-based methods (TALLRec, AlphaRec) by significant margins on MovieLens-1M (NDCG@1: 0.5783 vs 0.5392 for TALLRec).
Achieves consistent gains across four datasets (Beauty, Toys, Sports, MovieLens), improving NDCG@1 by ~7-20% over strong baselines.
Demonstrates robust generalization: The general 'base spirit' module alone often outperforms zero-shot LLMs and some traditional methods even without domain-specific tuning.

Breakthrough Assessment

8/10

Elegantly solves the dilemma between generalization and specialization in LLM-Rec via simple weight arithmetic. The entropy-guided merging makes it adaptive without retraining, offering high practical value.

⚙️ Technical Details

Problem Definition

Setting: Sequential recommendation: Predict the next item in a sequence given a user's history.

Inputs: User historical interaction sequence S_u and candidate item set.

Outputs: Ranked list or selection of the next likely item i_{u}^{L+1}.

Pipeline Flow

Instruction Dataset Construction (General & Specific)
Base Spirit Training (General LoRA)
Ingredient Training (Specific LoRA)
LoRA Cocktail Merging (Linear Arithmetic)
Entropy-Guided Adaptation (Test-time)

System Modules

General Instruction Construction

Aggregate multi-domain data into a uniform instruction format

Model or implementation: N/A (Data processing)

Base Spirit Tuner (Training)

Learn general recommendation knowledge

Model or implementation: Pre-trained LLM + LoRA

Ingredient Tuner (Training)

Learn domain-specific patterns

Model or implementation: Pre-trained LLM + LoRA

Cocktail Merger

Combine general and specific knowledge in weight space

Model or implementation: Linear Weight Arithmetic

Novel Architectural Elements

LoRA Cocktail Operator: A linear combination mechanism ⊕ that merges two distinct LoRA modules (General and Specific) into a single module effectively.
Entropy-Guided Adaptive Merging: A test-time adaptation loop that updates scalar mixing coefficients (λ) based on the entropy of the model's output on unlabeled test data.

Modeling

Base Model: Qwen2-7B (inferred from baseline comparison table mentioning Qwen2-7B-zeroshot)

Training Method: Instruction Tuning with LoRA

Objective Functions:

Purpose: Fine-tune the LLM to predict the next item.

Formally: Autoregressive language modeling loss L = -sum(log P(y_t | y_<t, x)).
Purpose: Optimize merging coefficients at test time.

Formally: Minimize Shannon entropy H(y_hat) = -sum(p(c) log p(c)) on unlabeled test samples.

Adaptation: LoRA (Low-Rank Adaptation)

Training Data:

General dataset constructed from multiple domains (Beauty, Toys, Sports, MovieLens)
Specific datasets for each target domain
Instruction template includes task description, history, and candidate items

Key Hyperparameters:

merging_coefficients: lambda1 + lambda2 = 1 (constraint)

Compute: Not reported in the paper

Comparison to Prior Work

vs. TALLRec: TALLRec only fine-tunes on target data; RecCocktail merges general pre-training with target tuning.
vs. P5: P5 relies on multi-task data merging at input level; RecCocktail merges knowledge in parameter space.
vs. Model Soups [not cited in paper]: Model Soups averages weights of models trained with different hyperparams on the SAME task; RecCocktail merges models from DIFFERENT domains/tasks.

Limitations

Relies on the assumption that task vectors are additive, which may not hold for all architecture types or divergent tasks.
Requires fine-tuning a separate LoRA for each specific domain (though the general module is reused).
Entropy minimization requires a batch of unlabeled test data at inference time to tune coefficients.

Reproducibility

Code: https://anonymous.4open.science/r/RecCocktail

Code is publicly available at https://anonymous.4open.science/r/RecCocktail. The paper details the merging formula and entropy minimization strategy. Specific hyperparameters like learning rate or batch size are not explicitly detailed in the text provided.

📊 Experiments & Results

Evaluation Setup

Next-item prediction (ranking) on sequential recommendation datasets.

Benchmarks:

Beauty (Sequential Recommendation)
Toys (Sequential Recommendation)
Sports (Sequential Recommendation)
MovieLens-1M (Sequential Recommendation)

Metrics:

NDCG@1
NDCG@3
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparison on sequential recommendation datasets (NDCG@1). RecCocktail consistently outperforms both breadth-oriented (P5) and depth-oriented (TALLRec) baselines.
Beauty	NDCG@1	0.3347	0.4132	+0.0785
Toys	NDCG@1	0.3746	0.4097	+0.0351
Sports	NDCG@1	0.3585	0.3754	+0.0169
MovieLens-1M	NDCG@1	0.5392	0.5783	+0.0391
Generalization capability testing using NDCG@3. The general module (RecCocktail-G) shows strong zero-shot performance compared to raw LLMs.
Beauty	NDCG@3	0.0260	0.2072	+0.1812

Experiment Figures

A conceptual illustration of 'Task Vectors' in weight space.

Main Takeaways

RecCocktail consistently outperforms single-paradigm methods (breadth-only or depth-only) across all datasets.
The 'base spirit' (general LoRA) provides a strong initialization that enhances performance even before domain-specific tuning.
Weight space merging is effective and efficient, introducing no additional inference latency unlike ensemble methods.
The method is robust to cold-start scenarios where traditional methods like SASRec often struggle.

📚 Prerequisite Knowledge

Prerequisites

Low-Rank Adaptation (LoRA) for LLMs
Sequential Recommendation
Instruction Tuning

Key Terms

LoRA: Low-Rank Adaptation—a technique to fine-tune LLMs by injecting trainable low-rank matrices while freezing the main weights.

Base Spirit: The domain-general LoRA module fine-tuned on a large-scale dataset aggregated from multiple recommendation domains.

Ingredient: The domain-specific LoRA module fine-tuned on data tailored to a specific target domain.

Entropy Minimization: An optimization objective used here to adjust merging coefficients at test time, favoring model configurations that produce confident (low-uncertainty) predictions.

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items.

Task Vector: The concept that the difference between fine-tuned and pre-trained weights represents a direction in parameter space encoding specific task capabilities.