QARM V2: Quantitative Alignment Multi-Modal Recommendation for Reasoning User Sequence Modeling

📝 Paper Summary

Industrial Recommendation Systems Multi-modal User Sequence Modeling

QARM V2 aligns LLMs with recommendation goals by filtering training data via reasoning models and generating hybrid quantized Semantic IDs that combine data-adaptive clustering with collision-free scalar quantization.

Core Problem

Traditional ID-based recommendation systems lack semantic understanding and generalization, while direct use of LLM embeddings suffers from misalignment with business objectives (Representation Unmatch) and difficulty in end-to-end optimization (Representation Unlearning).

Why it matters:

Standard ID embeddings discard knowledge when items go offline (Knowledge Isolation) and fail to capture fine-grained semantics beyond coarse tags
Directly using pre-trained LLM embeddings fails because LLM objectives (captioning) differ from RecSys goals (click prediction); e.g., visually similar items may have different usage scenarios
Freezing LLM embeddings prevents RecSys from adapting to dynamic user preferences, while fine-tuning billion-parameter LLMs end-to-end in real-time is computationally prohibitive

Concrete Example: A standard retrieval model might link 'soy sauce' and 'laundry detergent' due to exposure bias (both are popular), or 'toothpaste' and 'ointment' due to similar tube packaging. QARM V2 uses a reasoning LLM to reject these illogical pairs, ensuring embeddings reflect true underlying usage relationships.

Key Novelty

Reasoning-Aligned Embedding & Hybrid Res-KmeansFSQ Quantization

Uses a 'Reasoning Item Alignment' mechanism where a reasoning LLM filters noisy training pairs (e.g., popular but unrelated items) to ensure embeddings capture true business logic
Introduces 'Res-KmeansFSQ', a hybrid quantization method that uses K-means for coarse categorization (first 2 layers) and Finite Scalar Quantization (FSQ) for fine-grained details (last layer), preventing codebook collisions in long-tail data

Architecture

The 3-segment LLM training strategy for generating embeddings while maintaining text generation capabilities.

Evaluation Highlights

Rejecting 10%+ of Item2Item pairs and 70%+ of User2Item pairs during data construction significantly reduces noise from exposure bias
Deployed on Kuaishou's platform serving 400 million daily active users across shopping, advertising, and live-streaming scenarios
Reduces codebook collisions in the Shopping scenario compared to naive Res-Kmeans (where >30% of IDs had multiple candidates)

Breakthrough Assessment

8/10

Strong industrial application of LLMs to RecSys. The hybrid quantization (K-means + FSQ) and reasoning-based data cleaning are practical, high-impact innovations for billion-scale systems.

⚙️ Technical Details

Problem Definition

Setting: Lifelong user sequence modeling for industrial recommendation, split into General Search Unit (GSU) and Exact Search Unit (ESU)

Inputs: User historical interaction sequence {i_1, ..., i_T} and candidate item i_target

Outputs: Probability of user interaction (CTR/CVR)

Pipeline Flow

Data Construction: Reasoning LLM filters raw retrieval pairs -> Cleaned Pairs & QA Data
Embedding Generation: Multi-modal Item Data -> Fine-tuned LLM (3-segment) -> Continuous Embedding
Quantization: Continuous Embedding -> Res-Kmeans (Layers 1-2) -> FSQ (Layer 3) -> Semantic IDs
GSU Retrieval: Uses Continuous Embeddings to fetch top-K subsequences
ESU Ranking: Uses Semantic IDs as features -> Transformer/MoE -> Click Prediction

System Modules

Reasoning Filter

Filter noisy item pairs generated by statistical retrieval models

Model or implementation: Qwen3-0.6B (for Item2Item), Qwen3-8B (for User2Item)

Embedding Generator (Representation Learning)

Generate business-aligned dense embeddings for items

Model or implementation: Fine-tuned Multi-modal LLM (e.g., Qwen/Gemini based)

Res-KmeansFSQ Quantizer (Representation Learning)

Convert continuous embeddings into discrete Semantic IDs (SIDs)

Model or implementation: Hybrid Quantizer (2-layer K-means + 1-layer FSQ)

ESU Ranking Model

Predict user engagement based on historical SIDs

Model or implementation: Transformer with MoE prediction head

Novel Architectural Elements

Hybrid Res-KmeansFSQ quantization pipeline: combining data-dependent clustering (K-means) for head/torso items with data-independent scalar quantization (FSQ) for tail items to minimize collision
Three-segment LLM input masking: enabling simultaneous optimization of embedding compression (via <EMB> token) and next-token prediction (QA) in a single forward pass

Modeling

Base Model: Qwen3-0.6B/8B (Filtering), Qwen2.5-VL-72B (Data Gen), GPT-style LLM (Embedding)

Training Method: Supervised Fine-Tuning + Contrastive Learning

Objective Functions:

Purpose: Optimize embedding similarity for related item pairs.

Formally: Contrastive loss on <EMB> tokens using Gradient Cache
Purpose: Maintain LLM's reasoning and understanding ability.

Formally: Next Token Prediction (NTP) loss on QA segment
Purpose: Optimize downstream ranking performance.

Formally: Multi-task binary cross-entropy loss for CTR/CVR using learnable SID embeddings

Training Data:

Item2Item pairs from Swing (Exploitation)
User2Item pairs from Two-Tower (Exploration)
Filtered by Reasoning LLMs
QA pairs generated by Gemini/Qwen2.5-VL-72B

Key Hyperparameters:

quantization_layers: 3
FSQ_levels_L: 2
K_means_clusters: 8192 (example from text)
+ 1 more
FSQ_projection_dim: 13

Compute: Not reported in the paper

Comparison to Prior Work

vs. QARM (V1): QARM V2 adds Reasoning Item Alignment (data filtering) and replaces the last quantization layer with FSQ to reduce collisions
vs. CLIP: QARM V2 fine-tunes the LLM with RecSys-specific 'Reasoning' objectives rather than generic image-text matching
vs. TWIN: QARM V2 uses multi-modal Semantic IDs instead of pure ID embeddings, enabling better generalization to new items

Limitations

Dependency on large-scale reasoning models (Qwen-72B) for data construction creates high offline compute cost
Specific quantitative improvements (AUC/Revenue) are claimed but precise numbers are not available in the provided text snippet
FSQ requires predefined grid dimensions which may need manual tuning compared to fully learnable codebooks

Reproducibility

No replication artifacts mentioned in the paper. The system is deployed at Kuaishou. Specific datasets and code are not provided.

📊 Experiments & Results

Evaluation Setup

Industrial recommendation scenario (Shopping, Advertising, Live-streaming) at Kuaishou

Benchmarks:

Kuaishou Production Traffic (Online A/B Testing)

Metrics:

CTR (Click-Through Rate)
CVR (Conversion Rate)
Code Collision Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Data Construction Pipeline	Rejection Rate (Item2Item)	0	10	+10
Data Construction Pipeline	Rejection Rate (User2Item)	0	70	+70

Experiment Figures

Visual comparison of K-means centroids vs. FSQ grid in a 2D space.

Main Takeaways

Naive Res-Kmeans quantization leads to >30% code collision in shopping scenarios due to long-tail item distributions.
Hybrid quantization (Res-Kmeans + FSQ) effectively balances distribution fitting (via K-means) and distinctness (via FSQ), reducing collisions.
Filtering retrieval data with reasoning LLMs is crucial: 70% of User2Item pairs were rejected as semantically unrelated, highlighting the gap between ID-based retrieval and true semantic relevance.

📚 Prerequisite Knowledge

Prerequisites

Industrial Recommendation Systems (Two-stage: Retrieval/Ranking)
Vector Quantization (K-means, Residual Quantization)
Large Language Models (Transformers, Next Token Prediction)

Key Terms

GSU: General Search Unit—the first stage of a recommendation system that retrieves a small candidate set from a massive item pool

ESU: Exact Search Unit—the second stage that precisely ranks the candidate set using complex models

Semantic IDs: Discrete codes (tokens) representing items, derived from quantizing dense embeddings, allowing LLM-like sequence modeling for recommendations

Res-Kmeans: Residual K-means—a quantization method that approximates vectors by recursively clustering residuals (errors) from the previous step

FSQ: Finite Scalar Quantization—a method that quantizes vectors by projecting them into a fixed grid, ensuring uniform distribution and avoiding codebook collapse

Swing: An item-to-item collaborative filtering algorithm that calculates similarity based on the overlap of users who interacted with both items (User-Item-User paths)

Exposure Bias: The tendency for recommendation systems to suggest popular items simply because they are shown more often, not because they are most relevant

NTP: Next Token Prediction—the standard training objective for generative language models

MoE: Mixture of Experts—a neural network architecture that activates only a subset of sub-networks (experts) for each input to save compute

SIDs: Semantic IDs—see Semantic IDs definition above