Generating Long Semantic IDs in Parallel for Recommendation

📝 Paper Summary

Generative Recommendation Semantic ID-based Recommendation

RPG replaces autoregressive generation with parallel prediction of long, unordered semantic IDs, using a multi-token prediction loss and graph-constrained decoding to improve efficiency and expressiveness.

Core Problem

Existing generative recommenders rely on autoregressive decoding (generating one token at a time) and beam search, which forces the use of short, less expressive semantic IDs to maintain acceptable latency.

Why it matters:

Current generative models suffer from high inference latency due to multiple forward passes (one per token) required for autoregressive generation.
To mitigate latency, current models restrict semantic IDs to very short sequences (e.g., 4 tokens), which limits the semantic expressiveness of item representations compared to retrieval-based methods.
Running standard beam search on expressive, long semantic IDs (e.g., 32+ tokens) is computationally prohibitive.

Concrete Example: In TIGER, generating a recommendation requires 4 sequential forward passes because it uses 4-token IDs. If one wanted to use a more expressive 32-token ID (like VQ-Rec), TIGER would require 32 sequential forward passes and beam search steps, making real-time inference impossible.

Key Novelty

Parallel Semantic ID Generation with Graph-Constrained Decoding

Treats semantic IDs as unordered sets of tokens rather than sequences, allowing the model to predict all tokens of the next item simultaneously in a single forward pass.
Uses a graph-based decoding strategy during inference that connects semantically similar IDs, enabling efficient traversal of the candidate space without enumerating all items.

Architecture

The overall framework of RPG, illustrating the transition from Item Quantization to Parallel Generation and Graph Decoding.

Evaluation Highlights

Outperforms generative baselines by an average of 12.6% on NDCG@10 across benchmarks by scaling semantic ID length to 64 tokens.
Reduces inference time complexity to be independent of the total number of items, unlike retrieval baselines.
Achieves O(1) sequence encoder forward passes per recommendation, compared to O(b*m) for autoregressive models like TIGER.

Breakthrough Assessment

8/10

Significantly addresses the critical bottleneck of generative recommendation (latency) while simultaneously improving performance via longer IDs. The shift from autoregressive to parallel generation for semantic IDs is a strong methodological pivot.

⚙️ Technical Details

Problem Definition

Setting: Sequential recommendation: Given a user's interaction history sequence, predict the next item.

Inputs: Sequence of previously interacted items s = {i_1, i_2, ..., i_{t-1}}

Outputs: The next item i_t, represented as a semantic ID tuple

Pipeline Flow

Item Tokenization (OPQ) -> Semantic ID Construction
Sequence Encoding (Transformer) -> Sequence Representation
Parallel Token Prediction (Logit Cache) -> Candidate Scoring
Graph-Constrained Decoding -> Top-K Items

System Modules

Item Tokenizer (Input Processing)

Converts item features into discrete semantic IDs

Model or implementation: Optimized Product Quantization (OPQ)

Item Encoder (Input Processing)

Aggregates token embeddings into a single item vector

Model or implementation: Embedding Lookups + Aggregation Function

Sequence Encoder

Encodes user history into a context vector

Model or implementation: Transformer Decoder

Parallel Predictor (Generation)

Predicts probabilities for all token positions simultaneously

Model or implementation: Multi-Head MLP Projections

Graph Decoder (Generation)

Identifies valid semantic IDs from independent token predictions

Model or implementation: Graph Propagation Search

Novel Architectural Elements

Parallel prediction heads for all semantic ID tokens (removing autoregressive dependency)
Pre-computed similarity graph for decoding valid item combinations from independent token probabilities

Modeling

Base Model: Transformer-based Sequence Encoder (e.g., SASRec style)

Training Method: Multi-Token Prediction (MTP) learning

Objective Functions:

Purpose: Maximize the probability of the ground-truth semantic ID tokens independently given the history.

Formally: L_MTP = - sum_{j=1}^m log P^{(j)}(c_{t,j} | s)

Key Hyperparameters:

semantic_id_length: 64
codebook_size: 256
iterations_q: Not explicitly reported in the paper
+ 1 more
beam_size_b: Not explicitly reported in the paper

Compute: Inference time complexity O(M*m*d + b*q*k*m), independent of total item count N

Comparison to Prior Work

vs. TIGER: Parallel generation vs. autoregressive; long IDs (64) vs. short IDs (4); Graph decoding vs. Beam Search.
vs. VQ-Rec: Generative decoding independent of item count vs. retrieval requiring linear scan/ANN; MTP loss vs. Contrastive loss.
vs. HSTU [not cited in paper]: RPG focuses on the ID representation and decoding layer, whereas HSTU optimizes the attention mechanism itself.

Limitations

Sparse decoding space: predicting tokens independently makes mapping to valid items difficult without the graph constraint.
Graph construction overhead: requires building and storing a similarity graph of items, which must be updated as item pool changes.
Reliance on OPQ: performance depends on the quality of the initial quantization and feature extraction.

Reproducibility

Code: https://github.com/facebookresearch/RPG_KDD2025

Code is publicly available at https://github.com/facebookresearch/RPG_KDD2025. The paper relies on public benchmarks. Hyperparameters for specific graph construction (k neighbors) and decoding iterations (q) are discussed conceptually but exact values for all experiments are not in the main text.

📊 Experiments & Results

Evaluation Setup

Sequential recommendation on public datasets.

Benchmarks:

Sports (Sequential Recommendation)
Beauty (Sequential Recommendation)
Toys (Sequential Recommendation)
Yelp (Sequential Recommendation)

Metrics:

NDCG@10
Recall@10
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Average across datasets	NDCG@10	Not reported in the paper	Not reported in the paper	+12.6%

Main Takeaways

Scaling semantic ID length (up to 64) significantly improves performance compared to short IDs used in prior generative models.
Parallel generation combined with graph-constrained decoding achieves better efficiency than autoregressive beam search.
The graph-based decoding successfully mitigates the 'sparse decoding space' problem inherent in independent token prediction.

📚 Prerequisite Knowledge

Prerequisites

Generative Recommendation
Vector Quantization (Product Quantization)
Beam Search decoding
Sequential Recommendation

Key Terms

Semantic ID: A sequence of discrete tokens (integers) that represents an item, derived from quantizing its content features (e.g., text embeddings).

Product Quantization (PQ): A quantization technique that splits a vector into sub-vectors and quantizes each separately, resulting in a tuple of discrete codes.

Autoregressive Generation: Generating a sequence one token at a time, where each token depends on the previous ones.

Multi-token Prediction (MTP): A training objective where the model predicts multiple future tokens simultaneously and independently, rather than sequentially.

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items in the recommendation list.

Beam Search: A search algorithm that explores a graph by expanding the most promising nodes in a limited set (beam).

Graph-Constrained Decoding: A decoding method proposed in this paper that restricts the search space to neighbors in a pre-computed graph of similar semantic IDs.