← Back to Paper List

From Principles to Applications: A Comprehensive Survey of Discrete Tokenizers in Generation, Comprehension, Recommendation, and Information Retrieval

Jian Jia, Jingtong Gao, Ben Xue, Junhao Wang, Qingpeng Cai, Quan Chen, Xiangyu Zhao, Peng Jiang, Kun Gai
City University of Hong Kong
arXiv (2025)
MM Recommendation Pretraining

📝 Paper Summary

Multimodal Representation Learning Vector Quantization
This survey provides a unified framework and taxonomy for discrete tokenizers, analyzing their mechanisms, evolution from vanilla quantization to lookup-free methods, and applications across multimodal tasks.
Core Problem
Despite the critical role of discrete tokenizers in enabling LLMs to process continuous modalities (video, audio, image), no comprehensive survey exists to systematize their design principles, applications, and evolution.
Why it matters:
  • Suboptimal tokenization directly degrades performance in downstream generation and comprehension tasks by failing to preserve semantic fidelity
  • Modern LLMs require discrete inputs, making the tokenizer the essential bottleneck for unifying diverse modalities like video and audio into a single model
  • The field lacks a unified perspective connecting traditional vector quantization with modern lookup-free and semantic tokenization techniques
Concrete Example: In video synthesis, a poor tokenizer might fail to compress temporal redundancy, leading to incoherent frames, whereas an optimized tokenizer enables autoregressive models to generate smooth, contextually relevant video sequences.
Key Novelty
Comprehensive Taxonomy of Discrete Tokenizers
  • Deconstructs the tokenization pipeline into three universal steps: Encoding (dimensionality mapping), Quantization (discretization), and Supervision (reconstruction/consistency)
  • Classifies quantization methods into evolutionary stages: from Vanilla VQ to advanced structures like Residual (RQ), Product (PQ), and recent Lookup-Free (LFQ) approaches
Evaluation Highlights
  • Not applicable — this is a survey paper without new experimental results
Breakthrough Assessment
8/10
While a survey rather than a new method, it fills a significant gap by organizing a fragmented field crucial for multimodal LLMs. The taxonomy relating classical VQ to modern semantic tokenization is highly valuable.
×