Unleashing the Native Recommendation Potential: LLM-Based Generative Recommendation via Structured Term Identifiers

📝 Paper Summary

Generative Recommendation LLM-based Recommendation

GRLM replaces numerical IDs with 'Term IDs'—structured sets of semantic keywords derived from LLM vocabulary—and jointly fine-tunes the model on term generation and recommendation to mitigate hallucinations and improve transferability.

Core Problem

Existing generative recommenders struggle with item identification: text-based methods suffer from hallucinations and ambiguity, while Semantic IDs (discrete codes) lack semantic meaning and require costly vocabulary expansion.

Why it matters:

Hallucinations in text-based methods lead to recommending non-existent items, breaking system reliability
Semantic IDs (SIDs) create a semantic gap with the LLM's pre-trained knowledge, hindering cross-domain generalization
Current methods require expensive vocabulary resizing and alignment training to bridge the gap between IDs and natural language

Concrete Example: Independent generation leads to inconsistent labeling (e.g., 'Cell-Phone' vs. 'Mobile-Phone' for identical features) or fails to distinguish models (assigning generic 'iPhone' to distinct versions). GRLM uses context to enforce consistency.

Key Novelty

Term IDs (TIDs) via Context-Aware Generation

Represent items not as arbitrary numbers or raw titles, but as a fixed-length sequence of standardized, semantically rich keywords (Term IDs) derived from the LLM's native vocabulary
Use 'Context-aware Term Generation' where the LLM sees an item's neighbors to ensure terms are consistent across similar items (resolving synonyms) but distinct enough to separate specific products
Dual-track grounding (Direct + Structural Mapping) ensures generated text maps validly to real items even if the generation isn't an exact string match

Architecture

The three-stage pipeline of GRLM: Term Generation, Fine-tuning, and Grounding.

Evaluation Highlights

+30.2% Recall@5 improvement on Sports dataset compared to the strongest baseline (OneRec-Think)
Cross-domain Recall@K improves by >50% on average (e.g., Sports→Clothing) without specific alignment modules, leveraging natural language transfer
Achieves >99% Valid Rate and Direct Hit Rate, effectively eliminating hallucination issues common in text-based generative recommendation

Breakthrough Assessment

8/10

Significantly outperforms SOTA by rethinking item tokenization. The shift from SIDs to semantic Term IDs solves the vocabulary gap and hallucination issues simultaneously, with strong scaling properties.

⚙️ Technical Details

Problem Definition

Setting: Generative Sequential Recommendation

Inputs: User historical behavior sequence S = {i_1, i_2, ..., i_n}

Outputs: The Term IDs corresponding to the next item i_{n+1}

Pipeline Flow

Context-aware Term Generation (Offline/Preprocessing)
Integrative Instruction Fine-tuning (Training)
Elastic Identifier Grounding (Inference)

System Modules

Context-aware Term Generation (CTG)

Convert item metadata into standardized Term IDs using neighbor context

Model or implementation: Qwen3-4B-Instruct (Frozen)

Integrative Instruction Fine-tuning (IIFT)

Jointly train LLM on Term Internalization and Recommendation

Model or implementation: Qwen3-4B-Instruct (Fine-tuned)

Elastic Identifier Grounding (EIG)

Map generated text sequence to a specific item in the candidate library

Model or implementation: Algorithmic matching (Non-parametric)

Novel Architectural Elements

Replacement of numerical/quantized SIDs with 'Term IDs' (TIDs) that exist within the LLM's native vocabulary
Dual-track grounding mechanism (Elastic Identifier Grounding) combining exact string matching with decompositional term scoring

Modeling

Base Model: Qwen3-4B-Instruct (version Qwen3-4B-Instruct-2507)

Training Method: Full fine-tuning (Integrative Instruction Fine-tuning)

Objective Functions:

Purpose: Minimize prediction error for next token in sequence.

Formally: Standard negative log-likelihood (NLL) loss on output tokens conditioned on input instruction.

Adaptation: Full fine-tuning

Key Hyperparameters:

max_generation_length: 30
TID_length: 5
beam_size: Not explicitly reported in the paper (implies standard beam search)

Compute: Not reported in the paper

Comparison to Prior Work

vs. TIGER: GRLM uses native language tokens (Term IDs) instead of quantized codes, enabling better semantic transfer
vs. OneRec-Think: GRLM avoids vocabulary expansion and alignment training by using existing vocabulary
vs. TallRec: GRLM uses standardized, compact Term IDs rather than raw titles, reducing length and ambiguity
+ 1 more
vs. P5 [not cited in paper]: P5 uses raw text templates for various tasks; GRLM structures the identifier specifically as a set of keywords to enable structural grounding.

Limitations

Depends on the quality of the initial metadata to generate meaningful Term IDs
Inference efficiency might be lower than pure ID-based methods due to generating multiple tokens (keywords) per item
Requires a powerful base LLM for the offline Context-aware Term Generation step

Reproducibility

Code: https://github.com/ZY0025/GRLM

Code is publicly available at https://github.com/ZY0025/GRLM. The paper specifies the exact base model version (Qwen3-4B-Instruct-2507) and embedding model (Qwen3-Embedding-8B). Datasets are standard Amazon reviews.

📊 Experiments & Results

Evaluation Setup

Sequential Recommendation (In-domain) and Cross-Domain Recommendation

Benchmarks:

Amazon Beauty (Sequential Recommendation)
Amazon Sports (Sequential Recommendation)
Amazon Toys (Sequential Recommendation)
Sports-Clothing (Cross-Domain Recommendation)
Phones-Electronics (Cross-Domain Recommendation)

Metrics:

Recall@5
Recall@10
NDCG@5
NDCG@10
Valid Rate (VR@K)
Direct Hit Rate (DHR@K)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
In-domain performance comparisons showing GRLM consistently outperforming baselines, particularly on the Sports dataset.
Amazon Beauty	Recall@5	0.0638	0.0688	+0.0050
Amazon Sports	Recall@5	0.0384	0.0500	+0.0116
Amazon Toys	Recall@5	0.0543	0.0624	+0.0081
Cross-domain experiments demonstrating GRLM's superior transfer capability without specialized alignment modules.
Sports-Clothing	Recall@10	0.0468	0.0768	+0.0300
Phones-Electronics	Recall@10	0.0927	0.1415	+0.0488
Ablation studies validating the necessity of Context-aware Term Generation (CTG) and Generative Term Internalization (GTI).
Amazon Beauty	Recall@10	0.0792	0.0898	+0.0106
Amazon Beauty	Recall@10	0.0818	0.0898	+0.0080

Experiment Figures

Recall@10 performance scaling with model parameter size (0.6B to 14B) on three datasets.

Main Takeaways

GRLM consistently outperforms SOTA baselines (SIDs and Text-based) across all datasets, with massive gains (>30%) on Sports.
Cross-domain performance is exceptionally strong, validating that natural language Term IDs act as a 'semantic bridge' for knowledge transfer.
The method scales effectively: performance steadily improves as the backbone LLM size increases from 0.6B to 14B parameters (Scaling Law).

📚 Prerequisite Knowledge

Prerequisites

Generative Recommendation (predicting next item identifier autoregressively)
Semantic IDs (SIDs) vs. Textual IDs in Recommender Systems
Instruction Fine-tuning of LLMs
Beam Search decoding

Key Terms

Term IDs (TIDs): A structured item identifier consisting of a set of semantically rich, standardized textual keywords derived from the LLM's native vocabulary

Semantic IDs (SIDs): Discrete codes generated by quantizing item embeddings (e.g., via RQ-VAE), used in prior work like TIGER to represent items

Context-aware Term Generation (CTG): A process to generate TIDs by prompting an LLM with both the target item's metadata and the metadata of its nearest neighbors to ensure consistency and discriminability

Integrative Instruction Fine-tuning (IIFT): A multi-task training paradigm enabling the LLM to learn both 'Generative Term Internalization' (mapping metadata to TIDs) and 'User Behavior Sequence Prediction'

Elastic Identifier Grounding (EIG): A retrieval mechanism during inference that attempts exact string matching first, then falls back to a structural score based on term overlap to map generated tokens to real items

Direct Hit Rate (DHR): The proportion of successful retrievals handled by the Direct Mapping track within EIG (exact matches)

Valid Rate (VR): The proportion of generated identifiers that validly belong to the candidate library