← Back to Paper List

Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLMs

Yu Liang, Zhongjin Zhang, Yuxuan Zhu, Kerui Zhang, Zhiluohan Guo, Wenhang Zhou, Zonqi Yang, Kangle Wu, Yabo Ni, Anxiang Zeng, Cong Fu, Jianxin Wang, Jiazhi Xia
Central South University
arXiv (2026)
Recommendation P13N

📝 Paper Summary

Generative Recommendation Semantic ID (SID) Learning Vector Quantization for RecSys
ReSID improves generative recommendation by learning item representations directly from structured features via masked auto-encoding and quantizing them with globally aligned indices to reduce sequential uncertainty, removing reliance on LLMs.
Core Problem
Existing Semantic ID (SID) pipelines use foundation models optimized for semantic similarity, which misaligns with collaborative signals, and use generic quantization that ignores the sequential predictability required for autoregressive generation.
Why it matters:
  • Misalignment: Items that co-occur (e.g., snacks and balloons) may be semantically distant, confusing models that rely purely on semantic embeddings.
  • Inefficiency: Injecting collaborative signals into large LLMs is computationally expensive.
  • Unpredictability: Standard quantization (like RQ-VAE) creates codes with high prefix-conditional uncertainty, making the downstream task of autoregressive generation much harder.
Concrete Example: In hierarchical encoding, items with different semantics might share the same second-level token '1' (e.g., codes (2,1,5) and (9,1,7)). Ideally, token '1' should mean the same thing everywhere, but local indexing makes it context-dependent, confusing the generative model during sequential decoding.
Key Novelty
ReSID (Recommendation-Native Semantic ID)
  • Replaces LLM-based embeddings with Field-Aware Masked Auto-Encoding (FAMAE) that learns item representations by predicting masked structured features (like category, price) conditioned on user history.
  • Introduces Globally Aligned Orthogonal Quantization (GAOQ) which forces code indices at each level to align with global reference directions, ensuring consistent semantic meaning regardless of the prefix path.
Architecture
Architecture Figure Figure 2
The ReSID framework pipeline, showing the two main stages: FAMAE and GAOQ.
Evaluation Highlights
  • Outperforms strong sequential and SID-based generative baselines by an average of over 10% across ten datasets.
  • Reduces tokenization costs by up to 122x compared to LLM-based pipelines on million-scale datasets.
  • Achieves superior performance without using any pre-trained language models or pixel-based encoders.
Breakthrough Assessment
8/10
Significantly challenges the trend of using heavy LLMs for ID creation in RecSys. Demonstrates that domain-native features and principled quantization are far more efficient and effective.
×