RecBase: Generative Foundation Model Pretraining for Zero-Shot Recommendation

📝 Paper Summary

LLM-based recommendation Generative recommendation Zero-shot recommendation

RecBase is a foundation model pretrained from scratch on cross-domain recommendation data using a unified hierarchical item tokenizer to enable effective zero-shot transfer.

Core Problem

Existing LLM-based recommenders rely on language-centric pretraining that struggles to capture item-level sequential patterns, while ID-based models fail to generalize across domains due to disjoint vocabularies.

Why it matters:

The knowledge gap between language modeling and recommendation tasks limits the ability of standard LLMs to model item co-relationships effectively.
Traditional ID-based recommenders cannot handle zero-shot scenarios or new domains because item IDs are not transferable.
Mapping recommendation data directly to natural language is often verbose and may not effectively represent user behavioral sequences.

Concrete Example: When a standard LLM is asked to predict the next item for a user who bought a specific sequence of products, it often hallucinates or suggests generally popular items rather than personalized ones because it lacks specific collaborative signal knowledge. RecBase, by pretraining on 35M interactions across 15 domains, learns these specific sequential patterns.

Key Novelty

Curriculum Learning Enhanced RQ-VAE (CL-VAE) for Unified Item Tokenization

Standardizes item representations across domains by converting textual descriptions into hierarchical, discrete concept IDs using a shared encoder.
Uses curriculum learning to progressively train the quantization codebooks from coarse to fine, preventing codebook collapse and ensuring better utilization of the token space.
Pretrains an autoregressive Transformer on these discrete concept ID sequences across diverse domains to learn universal recommendation patterns.

Architecture

Overview of the RecBase framework, illustrating the two-stage process: (1) Item Tokenization via CL-VAE and (2) Autoregressive Pretraining.

Evaluation Highlights

RecBase-1.5B outperforms Llama-3-8B and Qwen-2-7B on zero-shot ranking tasks across 8 unseen datasets (e.g., +4.7% AUC on H&M, +2.4% AUC on Steam).
RecBase-0.3B (313M parameters) surpasses larger language models like OPT-1.3B and BERT-base in zero-shot performance while being significantly more efficient.
Fine-tuning yields further gains: +17.2% AUC improvement on Steam and +8.6% on MovieLens compared to zero-shot performance.

Breakthrough Assessment

8/10

Strong contribution in creating a true 'foundation model' for recommendation (trained from scratch on rec data) rather than just adapting an LLM. The unified tokenizer is a significant technical enabler for cross-domain transfer.

⚙️ Technical Details

Problem Definition

Setting: Sequential recommendation treated as a next-token prediction task over discrete item concept IDs.

Inputs: User's historical interaction sequence S = (s_1, s_2, ..., s_n), where each s_i is a sequence of discrete concept IDs derived from item text.

Outputs: Predicted concept IDs for the next item s_{n+1}.

Pipeline Flow

Text Encoder (NV-Embed-v2) → Dense Embeddings
CL-VAE Quantizer → Discrete Concept IDs (Tokenization)
Autoregressive Transformer → Next Concept ID Prediction

System Modules

Item Text Encoder

Convert unstructured item text descriptions into dense semantic embeddings

Model or implementation: NV-Embed-v2 (frozen)

CL-VAE Tokenizer

Discretize dense embeddings into a sequence of hierarchical tokens (Concept IDs)

Model or implementation: Hierarchical VQ-VAE with 4 levels, codebook size 2048

RecBase Model

Predict the next item's Concept ID sequence based on user history

Model or implementation: Decoder-only Transformer (Qwen2 architecture)

Novel Architectural Elements

Unified Tokenizer (CL-VAE): A curriculum-learning-based hierarchical quantizer that creates a shared discrete vocabulary for items across all domains, replacing domain-specific item IDs.
Pure Recommendation Pretraining: Unlike LLMs adapted for Rec, this model is trained from scratch *solely* on recommendation interaction sequences (represented as Concept IDs).

Modeling

Base Model: Custom Transformer based on Qwen2 architecture (0.3B and 1.5B variants)

Training Method: Autoregressive Pretraining (Next Token Prediction)

Objective Functions:

Purpose: Minimize prediction error for next token in sequence.

Formally: Negative Log-Likelihood Loss over concept ID bits.
Purpose: Train the tokenizer (CL-VAE).

Formally: Reconstruction Loss + Codebook Commitment Loss + Entropy Loss (to encourage codebook usage).

Training Data:

Pretraining corpus: 15 domains, 4.5M items, 35M interactions
Data source: RecBench (diverse public datasets like Amazon, Google Local, etc.)

Key Hyperparameters:

vocab_size: 20,000
codebook_levels: 4
codebook_size_per_level: 2048
+ 8 more
hidden_size_base: 1024
layers_base: 24
attention_heads_base: 16
hidden_size_large: 1536
layers_large: 28
attention_heads_large: 12
max_position_embedding_base: 32,768
max_position_embedding_large: 131,072

Compute: Not reported in the paper

Comparison to Prior Work

vs. P5/RecGPT: RecBase uses discrete Concept IDs derived from embeddings rather than natural language tokens to represent items, reducing sequence length and vocabulary mismatch.
vs. ID-based Sequential Rec (e.g., SASRec): RecBase is domain-agnostic and zero-shot capable due to the unified tokenizer, whereas SASRec is strictly single-domain [not cited in paper as direct zero-shot baseline, but standard in field].
vs. Tiger/PixelRec [not cited in paper]: Similar use of RQ-VAE for ID generation, but RecBase introduces Curriculum Learning (CL-VAE) to improve codebook utilization and cross-domain transfer.

Limitations

Dependency on quality of textual descriptions; poor item metadata may degrade embeddings and concept IDs.
Requires re-tokenization of items when applying to new domains (inference step involves encoding items first).
No reported computational cost (GPU hours) for the pretraining phase.
Comparison baselines for zero-shot are mostly LLMs; lack of comparison to other foundational ID-based models (if any exist).

Reproducibility

Code: https://github.com/reczoo/RecBase

Code and models promised at https://github.com/reczoo/RecBase. Pretraining data is compiled from public datasets (RecBench). Text encoder NV-Embed-v2 is public.

📊 Experiments & Results

Evaluation Setup

Zero-shot recommendation ranking on 8 unseen datasets. The model ranks candidate items based on the probability of generating their Concept IDs given the user history.

Benchmarks:

RecBench (unseen subset) (Sequential Recommendation / Ranking)

Metrics:

AUC (Area Under Curve)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot performance comparing RecBase-1.5B against general-purpose LLMs (Llama-3, Qwen-2) and rec-specific LLMs (P5, RecGPT).
H&M	AUC	0.6287	0.6761	+0.0474
Steam	AUC	0.8102	0.8343	+0.0241
Average (8 datasets)	AUC	0.5843	0.6063	+0.0220
Effect of Fine-tuning on specific domains.
Steam	AUC	0.8343	1.0066	+0.1723
Ablation of CL-VAE components.
Average (All)	AUC	0.5891	0.6063	+0.0172

Experiment Figures

Visualization of the latent space distribution comparing standard RQ-VAE vs. CL-VAE.

Codebook usage frequency distribution.

Main Takeaways

RecBase matches or exceeds 7B-parameter LLMs with only 1.5B parameters, validating the efficiency of recommendation-specific pretraining.
The CL-VAE tokenizer successfully maps diverse domains into a shared latent space, enabling zero-shot transfer where ID-based models traditionally fail.
Fine-tuning significantly boosts performance, but the model provides a strong starting point (foundation) even without it.
Inference efficiency is vastly superior to LLMs due to the compact vocabulary (20k vs >100k) and shorter sequence lengths (concept IDs vs verbose text).

📚 Prerequisite Knowledge

Prerequisites

Vector Quantized Variational Autoencoders (VQ-VAE / RQ-VAE)
Autoregressive Transformer architectures
Sequential Recommendation
Curriculum Learning

Key Terms

RQ-VAE: Residual Quantized Variational Autoencoder—a method to compress high-dimensional vectors into discrete codes by recursively quantizing residuals

CL-VAE: Curriculum Learning Enhanced RQ-VAE—the paper's proposed tokenizer that trains hierarchical codebooks stage-by-stage to prevent collapse

Concept ID: A sequence of discrete tokens (IDs) representing an item, generated by the VAE, used as the vocabulary for the recommendation model

Codebook Collapse: A failure mode in VQ-VAE where only a small subset of discrete codes is used, reducing representational capacity

Zero-shot recommendation: Making recommendations on a dataset/domain the model was not trained on, without any parameter updates