PLUM: Adapting Pre-trained Language Models for Industrial-scale Generative Recommendations

📝 Paper Summary

Generative Recommendation LLM Adaptation for Recommendation Industrial Recommender Systems

PLUM replaces massive embedding tables with compact Semantic IDs and adapts pre-trained LLMs via continued pre-training on user behavior data to perform generative retrieval at YouTube scale.

Core Problem

Standard Large Embedding Models (LEMs) rely on massive embedding tables that hit scaling bottlenecks and prevent using deeper networks, while off-the-shelf LLMs lack domain-specific knowledge of user behavior and item catalogs.

Why it matters:

LEMs waste parameters on memorizing IDs rather than learning complex reasoning, limiting improvements from scaling neural networks
Directly applying LLMs to recommendation fails because they don't understand user preferences or the specific item corpus (domain gap)
Large embedding tables require massive training data, making it computationally expensive to train large Transformer architectures

Concrete Example: A standard LLM might recommend a generic 'funny cat video' based on text, but fails to identify the specific video ID 'Vid123' that a user with a specific watch history (e.g., specific gaming channels) would actually click next.

Key Novelty

End-to-end framework aligning LLMs with discrete item tokens (SIDs) via Continued Pre-Training

Replaces random item IDs with Semantic IDs (SIDs)—hierarchical discrete codes derived from multi-modal content and user co-occurrence patterns—allowing items to be processed as language tokens
Bridges the domain gap by continuing to pre-train (CPT) the LLM on a mixture of user history sequences and video metadata, teaching it to predict SIDs from context before fine-tuning
Shifts model complexity from massive memory-based embedding tables (LEMs) to compute-based deep neural networks, enabling better scaling with model size

Architecture

The SID-v2 training pipeline (RQ-VAE based).

Evaluation Highlights

PLUM achieves substantially better sample efficiency than a heavily-optimized production LEM, matching performance with significantly fewer training examples
Retrieval performance scales effectively with model size, continuing to improve up to a Mixture-of-Experts (MoE) model with over 900M activated parameters
Hallucination rate (generating invalid SIDs) drops to < 5% after supervised fine-tuning

Breakthrough Assessment

9/10

Successfully deploys generative retrieval at YouTube scale (billions of users/items), proving LLMs can replace massive embedding tables in industrial settings. A major architectural shift from LEMs.

⚙️ Technical Details

Problem Definition

Setting: Generative retrieval where a model autoregressively generates the Semantic IDs of the next item a user will engage with

Inputs: User context sequence including watch history (SIDs), numerical features, and other text features

Outputs: Sequence of discrete tokens representing the Semantic ID of the recommended item

Pipeline Flow

Item Tokenization (SID-v2 Generation)
Continued Pre-Training (CPT)
Supervised Fine-Tuning (SFT)
Inference (Beam Search Generation)

System Modules

SID-v2 Quantizer

Convert item content (video/audio/text) into discrete hierarchical tokens

Model or implementation: Enhanced RQ-VAE with multi-modal fusion

Generative Recommender

Predict the next item's SID sequence based on user history

Model or implementation: Decoder-only Transformer (LLM)

Novel Architectural Elements

Replacement of massive embedding tables with compact Semantic IDs (SIDs) processed by LLMs
Multi-resolution codebook structure for SIDs (decreasing cardinality at deeper levels: 2048 -> 1024 -> ...)
Injection of collaborative signals (co-occurrence) directly into the quantization (SID generation) process

Modeling

Base Model: Pre-trained Decoder-only Transformer LLM (specific family not named, likely internal Google model)

Training Method: Continued Pre-Training (CPT) followed by Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Encode items into discrete SIDs capturing content and behavior.

Formally: L = L_recon + L_rq + L_con (Reconstruction + Quantization + Co-occurrence Contrastive)
Purpose: Align LLM with SIDs and domain data (CPT).

Formally: Next-token prediction on mixture of user behavior sequences and video metadata
Purpose: Optimize for recommendation clicks (SFT).

Formally: Autoregressive maximum-likelihood on clicked video SIDs, weighted by reward r(user, v_click)

Training Data:

CPT Data: 50% User behavior data (watch histories), 50% Video metadata corpus (SID, title, description, etc.)
SFT Data: User context and history pairs with ground-truth clicked videos, sampled based on reward

Key Hyperparameters:

cpt_steps: 1 million
cpt_batch_size: 16
cpt_total_tokens: approx 260 billion

Compute: Not reported in the paper

Comparison to Prior Work

vs. TIGER: PLUM adds Multi-Resolution Codebooks, Progressive Masking, and Co-occurrence Contrastive Loss to the SID generation
vs. LEMs: PLUM removes massive embedding tables, shifting parameters to the dense network
vs. MMQ: PLUM concatenates embeddings before encoding rather than generating separate tokens per modality
+ 1 more
vs. Uncited approaches: PLUM uniquely emphasizes the 'Continued Pre-Training' stage to bridge the LLM domain gap, which many scratch-trained generative recommenders skip

Limitations

Inference latency and cost for autoregressive generation are likely higher than dot-product retrieval (though not explicitly quantified)
Requires sophisticated serving infrastructure to handle beam search at billion-user scale
Hallucination of invalid SIDs is possible (though reduced to <5% via SFT)
Relies on expensive pre-training and CPT stages compared to standard supervised learning

Reproducibility

No replication artifacts mentioned in the paper. The system is trained on internal Google/YouTube datasets and uses internal infrastructure. Code and model weights are not provided.

📊 Experiments & Results

Evaluation Setup

Large-scale internal video recommendation (YouTube)

Benchmarks:

Internal YouTube Video Recommendation Datasets (Next-item prediction / Retrieval) [New]

Metrics:

Retrieval Performance (Softmax Loss / Recall proxy)
Sample Efficiency
Hallucination Rate
SID-to-Video Collision Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Scaling studies show that PLUM's retrieval performance improves predictably with model size and training duration.
Internal YouTube Dataset	Retrieval Performance	Not reported in the paper	Not reported in the paper	-
Internal YouTube Dataset	Training Steps to Convergence	Not reported in the paper	Not reported in the paper	-
Internal YouTube Dataset	Hallucination Rate (Invalid SIDs)	Not reported in the paper	< 5%	Not reported in the paper

Experiment Figures

The Generative Retrieval Model Architecture and Input Prompt.

Main Takeaways

Replacing embedding tables with dense LLM parameters improves sample efficiency significantly, allowing models to learn faster from fewer examples.
Continued Pre-Training (CPT) is critical for bridging the gap between pre-trained LLM knowledge and domain-specific recommendation tasks.
The proposed SID-v2 (with co-occurrence loss and multi-modal fusion) effectively captures both content semantics and collaborative signals.
The framework scales effectively to MoE architectures with billions of parameters, suggesting a path forward for scaling industrial recommenders beyond embedding table bottlenecks.

📚 Prerequisite Knowledge

Prerequisites

Deep Learning Recommendation Models (DLRM)
Vector Quantization (RQ-VAE)
Transformer architectures
Large Language Model pre-training and fine-tuning

Key Terms

Semantic IDs (SIDs): Discrete, hierarchical token sequences representing items, derived from content embeddings via quantization

LEM: Large Embedding Model—traditional recommendation architecture relying on massive embedding tables for categorical features (e.g., item IDs)

RQ-VAE: Residual-Quantized Variational AutoEncoder—a model used to compress dense embeddings into discrete codes (SIDs) by recursively quantizing residuals

CPT: Continued Pre-Training—an intermediate training stage where an LLM is trained on domain-specific data (user history, item metadata) to align SIDs with text

SFT: Supervised Fine-Tuning—the final training stage optimizing the model for the specific recommendation objective (predicting the next clicked video SID)

MoE: Mixture-of-Experts—a neural network architecture where different sub-networks (experts) are activated for different inputs, allowing immense scaling

Generative Retrieval: A paradigm where the model directly generates item identifiers (like SIDs) rather than selecting them via dot-product similarity search

Co-occurrence contrastive loss: A loss function used during SID training that pulls representations of items watched together closer, injecting collaborative filtering signals

Progressive Masking: A technique in SID training that randomly masks deeper codebook levels to enforce hierarchical structure and robustness