RecGPT: A Foundation Model for Sequential Recommendation

📝 Paper Summary

Generative recommendation Sequential recommendation Foundation models for recommendation

RecGPT is a text-driven foundation model that replaces item IDs with quantized semantic tokens and uses a hybrid attention mechanism to enable zero-shot sequential recommendation across new domains without retraining.

Core Problem

Traditional sequential recommenders rely on specific item IDs, making them unable to generalize to new domains or items (cold-start) without extensive retraining.

Why it matters:

Recommender systems fail in data-sparse environments or when introducing new product lines because ID embeddings lack semantic transferability.
Current approaches require resource-intensive retraining cycles whenever the item catalog changes significantly.
Existing ID-based methods cannot effectively handle the 'cold-start' problem where new items lack interaction history.

Concrete Example: A system trained on Amazon 'Baby' products cannot recommend 'Games' because the item IDs (e.g., item_386) are disjoint. RecGPT processes the text description 'basketball' directly, allowing it to recommend sports items even if it was only trained on baby products.

Key Novelty

Text-Driven Foundation Model with Finite Scalar Quantization (RecGPT)

Derives item representations exclusively from text using an encoder and Finite Scalar Quantization (FSQ) to create a domain-invariant discrete token space, eliminating the need for domain-specific item IDs.
Employes a hybrid attention mechanism that is bidirectional within an item's token sequence (to maintain semantic coherence) but causal between items (to model sequential user history).
Integrates auxiliary continuous semantic embeddings alongside discrete tokens to prevent information loss typically associated with quantization.

Architecture

The complete RecGPT architecture including tokenization, modeling, and decoding.

Evaluation Highlights

Achieves significantly higher zero-shot Hit@5 on the 'Baby' dataset (0.0283) compared to few-shot baselines like BERT4Rec (0.0099) that had access to 10% target data.
Outperforms state-of-the-art methods in cold-start scenarios on the 'Office' dataset, reaching a Hit@5 of 0.0204 vs. 0.0207 for the strongest baseline (DuoRec) which was trained on domain data.
Demonstrates power-law scaling properties similar to LLMs, where zero-shot performance consistently improves as pre-training data volume increases from 5% to 100%.

Breakthrough Assessment

9/10

Ideally solves the long-standing problem of ID dependency in recommenders. By successfully applying FSQ and LLM-style generation to recommendation, it achieves genuine zero-shot transfer, a major leap over ID-based transfer learning.

⚙️ Technical Details

Problem Definition

Setting: Sequential recommendation formulated as autoregressive next-token prediction

Inputs: Sequence of user interactions X = (x_1, ..., x_n) where each x is a text description

Outputs: The next item x_{n+1} (represented as a sequence of discrete tokens)

Pipeline Flow

Text Encoder (MPNet) → Embedding Quantization (FSQ) → Sequence Modeling (Transformer) → Item Decoding (Beam Search)

System Modules

Text Encoder

Convert item text (title, category) into continuous semantic embeddings

Model or implementation: MPNet

Unified Item Tokenizer

Transform continuous embeddings into discrete tokens for autoregressive modeling

Model or implementation: Finite Scalar Quantization (FSQ)

Universal Recommendation Model

Predict next tokens based on history, capturing item and sequence dependencies

Model or implementation: Transformer Decoder with Hybrid Attention

Efficient Item Token Decoder

Map predicted tokens back to valid catalog items

Model or implementation: Catalog-aware Beam Search

Novel Architectural Elements

Hybrid bidirectional-causal attention: Bidirectional for tokens within one item (intra-item), causal for tokens between different items (inter-item).
Dual-stream embedding injection: Concatenates discrete token embeddings (E_wte) with continuous auxiliary embeddings (E_aux) before the transformer layers to mitigate quantization loss.

Modeling

Base Model: Transformer (Decoder-only architecture)

Training Method: Autoregressive Next-Token Prediction

Objective Functions:

Purpose: Minimize reconstruction error of embeddings during quantization.

Formally: L_fsq = ||e_i - Decoder(Quantized(e_i))||_1
Purpose: Maximize likelihood of next token in sequence.

Formally: L_ar = - sum log P(Y_t | X_{<t})

Training Data:

Pre-trained on large-scale datasets across multiple domains (Amazon, Yelp, etc.)
Evaluated on 6 datasets

Key Hyperparameters:

codebook_size: Not explicitly reported in the paper
tokens_per_item_K: Not explicitly reported in the paper
embedding_dimension: Not explicitly reported in the paper

Comparison to Prior Work

vs. S3-Rec/BERT4Rec: RecGPT is text-driven and zero-shot, whereas baselines require ID embeddings and in-domain training.
vs. VQ-Rec: RecGPT uses Finite Scalar Quantization (FSQ) and a decoder-only architecture, whereas VQ-Rec typically uses standard VQ-VAE and may rely on different reconstruction objectives.
vs. P5 [not cited in paper]: P5 formulates recommendation as distinct prompts/tasks for an LLM; RecGPT creates a dedicated tokenized item space specifically for efficient autoregressive recommendation.

Limitations

Dependency on the quality of textual descriptions; poor text data may degrade embeddings.
Inference cost of beam search decoding is higher than simple dot-product retrieval of ID-based methods.
The paper does not explicitly detail the computational cost (training time/resources) compared to lightweight ID-based models.

Reproducibility

Code: https://github.com/HKUDS/RecGPT

Code is publicly available at https://github.com/HKUDS/RecGPT. The paper details the datasets (Amazon, Yelp, Steam, etc.) and baselines. Hyperparameters like specific learning rates or exact transformer dimensions are not explicitly listed in the main text.

📊 Experiments & Results

Evaluation Setup

Cross-domain Zero-shot (train on source, test on target without retraining) and Cold-start (truncated history).

Benchmarks:

Amazon Reviews (Baby, Games, Office) (Sequential Recommendation)
Yelp (Point-of-Interest Recommendation)
Steam (Game Recommendation)
Industrial Dataset (News Recommendation)

Metrics:

Hit@1
Hit@5
NDCG@5
Statistical methodology: Statistical significance (p < 0.05) indicated by *

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot performance of RecGPT (no training on target) vs. Baselines trained on 10% of target data (Few-shot).
Baby	Hit@5	0.0099	0.0283	+0.0184
Baby	NDCG@5	0.0061	0.0279	+0.0218
Yelp	Hit@5	0.0083	0.0166	+0.0083
Cold-start performance where user history is truncated to 1-3 items.
Office	Hit@5	0.0186	0.0204	+0.0018
Office	NDCG@5	0.0137	0.0197	+0.0060
Ablation studies validating architectural choices.
Baby	Hit@5	0.0178	0.0283	+0.0105
Baby	Hit@5	0.0279	0.0283	+0.0004

Experiment Figures

Comparison of Zero-Shot RecGPT vs Few-Shot Baselines on Baby and Yelp datasets.

Main Takeaways

RecGPT consistently outperforms baselines in zero-shot settings, even beating models trained on 10-50% of the target domain data.
The model exhibits scaling laws similar to LLMs: performance improves linearly/power-law wise with the amount of pre-training data.
Finite Scalar Quantization (FSQ) is critical; replacing it with random tokens causes drastic performance drops, proving the semantic value of the tokenization.
Cold-start performance is robust, indicating that the model learns generalizable sequential patterns that apply even with minimal user history.

📚 Prerequisite Knowledge

Prerequisites

Transformer architectures (Attention mechanisms)
Vector Quantization (VQ) concepts
Sequential Recommendation basics

Key Terms

FSQ: Finite Scalar Quantization—a method to map continuous embeddings to discrete tokens by projecting them into a hypercube and rounding, avoiding codebook collapse.

MPNet: A pre-trained language model combining Masked Language Modeling and Permuted Language Modeling, used here to encode item text.

STE: Straight-Through Estimator—a technique allowing gradients to bypass non-differentiable rounding functions during backpropagation.

beam search: A search algorithm that explores a graph by expanding the most promising nodes in a limited set, used here to decode tokens into items.

Trie: A tree data structure used to store the item catalog's token sequences, enabling efficient prefix-constrained decoding.

cold-start: A scenario where the system must recommend items or to users with little to no historical interaction data.

zero-shot: Evaluating a model on a dataset or domain it has never seen during training.

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items in the recommendation list.

Hit@K: A metric measuring the proportion of times the ground-truth item appears in the top-K recommendations.