← Back to Paper List

A Bi-Step Grounding Paradigm for Large Language Models in Recommendation Systems

Keqin Bao, Jizhi Zhang, Wenjie Wang, Yang Zhang, Zhengyi Yang, Yancheng Luo, Chong Chen, Fuli Feng, Qi Tian
University of Science and Technology of China, National University of Singapore, Huawei Inc.
arXiv (2023)
Recommendation P13N

📝 Paper Summary

LLM for Recommendation (LLM4Rec) Generative Recommendation
BIGRec enables LLMs to perform global item ranking by first generating meaningful item descriptions (grounding to recommendation space) and then mapping these descriptions to real items using similarity and statistical priors.
Core Problem
Existing LLM-based recommendation methods often evaluate on limited candidate sets (e.g., negative sampling) rather than the full item space, failing to reflect true global ranking capabilities.
Why it matters:
  • Restricted evaluation (like negative sampling) is a poor indicator of true recommender system performance compared to all-rank settings
  • Directly generating item names with LLMs may produce hallucinations (items that don't exist) or fail to map to specific catalog IDs
  • LLMs struggle to incorporate statistical signals like item popularity and collaborative filtering purely through in-context learning or basic fine-tuning
Concrete Example: An LLM might recommend 'Iron Man (Sichuan dialect)'—a creative but non-existent movie. Standard systems fail to map this hallucinations to a valid catalog item, while BIGRec grounds this generation to the closest real item, 'Iron Man'.
Key Novelty
Bi-Step Grounding Paradigm (BIGRec)
  • Step 1: Ground LLM to 'Recommendation Space' by fine-tuning it to generate valid, meaningful item descriptions (tokens) based on user history.
  • Step 2: Ground generated descriptions to 'Actual Item Space' by calculating similarity between the LLM's output embedding and real item embeddings, weighted by statistical priors like popularity.
Architecture
Architecture Figure Figure 1
The BIGRec framework flow: Language Space → Recommendation Space → Actual Item Space.
Evaluation Highlights
  • Outperforms traditional baselines (e.g., SASRec) and LLM-based methods (e.g., TALLRec) in few-shot and multi-domain settings.
  • Achieves superior performance with significantly less training data; outperforms traditional models trained with 100x or 1000x more samples in low-resource scenarios.
  • Demonstrates that scaling training data yields diminishing returns for LLMs compared to ID-based models, suggesting LLMs rely more on semantic priors than statistical patterns.
Breakthrough Assessment
8/10
Addresses the critical 'hallucination vs. retrieval' gap in generative recommendation. The finding that LLMs hit a data-scaling plateau compared to ID-based models is a significant insight for the field.
×