CALRec: Contrastive Alignment of Generative LLMs for Sequential Recommendation

📝 Paper Summary

Sequential Recommendation Generative Recommendation

CALRec adapts generative LLMs for sequential recommendation using a two-tower contrastive alignment objective alongside next-item generation, refined through a two-stage fine-tuning process.

Core Problem

Pretrained LLMs lack specific understanding of user sequential behavior and domain-specific item attributes, while standard next-token prediction losses fail to capture high-level user-item alignment.

Why it matters:

Traditional ID-based recommenders struggle with dynamic user interests and lack semantic understanding of item content
Directly applying LLMs to recommendation often yields suboptimal results without domain adaptation or structural alignment
Pure text-based recommendation avoids the cold-start issues associated with fixed ID embeddings

Concrete Example: A user buys a hammer, then nails. A standard LLM might predict generic text continuation. CALRec aligns the user's history embedding (hammer, nails) directly with the target item embedding (wood glue) via contrastive loss to ensure the generated text actually describes the relevant next item.

Key Novelty

Contrastive Aligned Generative LLM Recommendation (CALRec)

Combines standard next-token generation loss with auxiliary contrastive losses (InfoNCE) that align user history representations with target item representations in a shared latent space
Implements a two-stage fine-tuning strategy: first joint training across multiple categories for general patterns, then category-specific refinement
Uses a 'quasi-round-robin' BM25 retrieval mechanism to map generated text descriptions back to specific items in the corpus

Architecture

The overall framework of CALRec, illustrating the prompt structure, the two-tower contrastive alignment, and the training objectives.

Evaluation Highlights

+37% improvement in Recall@1 compared to state-of-the-art baselines on Amazon Review datasets
+24% improvement in NDCG@10 compared to state-of-the-art baselines on Amazon Review datasets
Outperforms both traditional sequential models (SASRec) and LLM-based approaches (GPT4Rec) across five domain categories

Breakthrough Assessment

7/10

Strong empirical gains and a sensible integration of contrastive learning with generative LLMs. While the components (contrastive loss, two-stage tuning) are known, their specific application to text-based sequential RecSys is effective.

⚙️ Technical Details

Problem Definition

Setting: Sequential recommendation as a sequence-to-sequence generation task using pure text

Inputs: Text description of user interaction history I_{1:n-1}

Outputs: Text description of the target item I_n

Pipeline Flow

Input Processing (Prompt Construction)
LLM Encoding & Generation
Contrastive Alignment (Training only)
Retrieval & Ranking (Inference only)

System Modules

Prompt Constructor

Flattens item attributes into text and adds structural prefixes (e.g., 'The next item bought is...')

Model or implementation: Rule-based template

CALRec Backbone

Generates next item text and produces embeddings for alignment

Model or implementation: Llama-2-7b-chat (LoRA adapted)

Retrieval System

Maps generated text to concrete items in the corpus

Model or implementation: BM25 with Quasi-Round-Robin modulation

Novel Architectural Elements

Integration of a two-tower contrastive alignment head directly into the decoder-only LLM fine-tuning process
Quasi-round-robin BM25 retrieval mechanism that modulates text-matching scores with LLM generation probabilities

Modeling

Base Model: Llama-2-7b-chat

Training Method: Supervised Fine-Tuning with auxiliary Contrastive Loss

Objective Functions:

Purpose: Learn to generate the text of the next item.

Formally: L_NIG = - sum log p(t_i | t_{<i})
Purpose: Align user history representation with target item representation.

Formally: L_cl = L_{user-item} + L_{item-user} (InfoNCE loss on mean-pooled embeddings)

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: LoRA parameters (r=8, alpha=16, dropout=0.05)

Training Data:

Amazon Review Datasets (Beauty, Toys, Tools, Pet, Movie)
Stage 1: Sampled mixture proportional to |U|^0.3
Stage 2: Domain-specific data

Key Hyperparameters:

learning_rate: 3e-4 (Stage 1), 1e-4 (Stage 2)
batch_size: 128
epochs: 1 (Stage 1), 5 (Stage 2)
+ 2 more
context_length: 1024
contrastive_temperature: 1.0

Compute: 8 NVIDIA A100 (80GB) GPUs

Comparison to Prior Work

vs. SASRec: Uses pure text instead of IDs; handles cold-start better
vs. GPT4Rec: Uses Llama-2 vs GPT-2; adds contrastive alignment; uses two-stage fine-tuning
vs. TALLRec: Generates item text directly rather than classifying preference [not cited in paper]

Limitations

Inference latency is high due to decoding long text sequences
Performance depends on the quality of textual attributes in the dataset
Contrastive alignment adds computational overhead during training
Requires mapping generated text back to items, which can introduce retrieval errors

Reproducibility

Code availability is not provided in the paper. Dataset is public (Amazon Review). Hyperparameters for LoRA and training are detailed. Prompts are described in text.

📊 Experiments & Results

Evaluation Setup

Sequential recommendation (next-item prediction) on Amazon Review datasets

Benchmarks:

Amazon Beauty (Sequential Recommendation)
Amazon Toys (Sequential Recommendation)
Amazon Tools (Sequential Recommendation)
Amazon Pet (Sequential Recommendation)
Amazon Movie (Sequential Recommendation)

Metrics:

Recall@1
Recall@5
Recall@10
NDCG@5
NDCG@10
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CALRec significantly outperforms baselines across all categories on Recall@1.
Amazon Beauty	Recall@1	0.0381	0.0628	+0.0247
Amazon Toys	Recall@1	0.0583	0.0763	+0.0180
CALRec shows consistent improvements in NDCG@10, indicating better ranking quality.
Amazon Beauty	NDCG@10	0.0635	0.0886	+0.0251
Amazon Toys	NDCG@10	0.0863	0.1030	+0.0167
Ablation studies confirm the necessity of both contrastive alignment and two-stage fine-tuning.
Amazon Beauty	Recall@5	0.0831	0.0889	+0.0058
Amazon Beauty	Recall@5	0.0811	0.0889	+0.0078

Main Takeaways

CALRec achieves substantial improvements over ID-based (SASRec) and text-based (GPT4Rec) baselines, particularly in Recall@1.
Two-stage fine-tuning (Joint + Category-Specific) is critical for performance, effectively transferring knowledge across domains.
The auxiliary contrastive loss aligns user and item representations in the latent space, boosting retrieval performance beyond simple next-token prediction.
Text-based repetition (Last Item Repeater) is a strong baseline in some datasets, highlighting the need for careful deduplication in sequential recommendation evaluation.

📚 Prerequisite Knowledge

Prerequisites

Sequential Recommendation
Contrastive Learning (InfoNCE)
Language Modeling (Next Token Prediction)
BM25 Retrieval

Key Terms

CALRec: Contrastive Aligned Generative LLM Recommendation—the proposed framework combining generative and contrastive objectives

NIG: Next-Item Generation—the standard autoregressive language modeling objective applied to generating item descriptions

InfoNCE: Information Noise Contrastive Estimation—a loss function used to learn representations by pulling positive pairs together and pushing negative pairs apart

BM25: Best Matching 25—a ranking function used in information retrieval to estimate the relevance of documents to a search query

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items

Recall@K: The proportion of relevant items found in the top-K recommendations

Two-tower framework: A neural network architecture with separate encoders (towers) for user and item, used here for computing contrastive alignment

Llama-2-7b-chat: The specific open-source Large Language Model used as the backbone for CALRec

SASRec: Self-Attentive Sequential Recommendation—a baseline model using self-attention to model user history

GPT4Rec: A generative sequential recommendation baseline based on GPT-2