← Back to Paper List

Large Language Models are Competitive Near Cold-start Recommenders for Language- and Item-based Preferences

Scott Sanner, Krisztian Balog, Filip Radlinski, Ben Wedin, Lucas Dixon
University of Toronto, Google
arXiv (2023)
Recommendation P13N Benchmark

📝 Paper Summary

Cold-start recommendation Conversational recommendation Natural Language User Profiles
Large Language Models using natural language preference descriptions perform competitively with state-of-the-art item-based collaborative filtering for near cold-start recommendation, without needing supervised training.
Core Problem
Traditional recommender systems rely on extensive item rating history (collaborative filtering), which fails in cold-start scenarios where users have few ratings, and lacks transparency.
Why it matters:
  • Cold-start is a pervasive problem: new users often abandon platforms before generating enough data for good recommendations
  • Item-based embeddings are inscrutable: users cannot understand or edit the internal vector representation of their preferences
  • Conversational interfaces are growing, but controlled comparisons of natural language preferences versus traditional item ratings are missing
Concrete Example: A user says 'I like comedy movies because I feel happy' but has no rating history. A traditional Matrix Factorization model cannot recommend anything personalized. An LLM can interpret this text, but it is unclear if it performs as well as if the user had just rated 5 specific comedy movies.
Key Novelty
Unified LLM Prompting for Language & Item Preferences
  • Collects a parallel dataset where users provide BOTH natural language descriptions of tastes ('I like sci-fi...') AND 5 specific item ratings, allowing direct comparison
  • Treats recommendation as a conditional generation task where the LLM scores candidate items based on a prompt containing user text descriptions, liked items, or both
  • Demonstrates that natural language descriptions alone (zero-shot) are sufficient for LLMs to match the performance of collaborative filtering trained on item ratings
Evaluation Highlights
  • LLM with few-shot (3 examples) prompting achieves 0.572 NDCG@10 on unseen items, statistically tying with strong BPR-SLIM baseline (0.577)
  • LLM using ONLY language descriptions (Zero-shot) achieves 0.563 NDCG@10 on unseen items, outperforming standard Matrix Factorization (WRMF) at 0.573 and Item-kNN at 0.565 within error margins
  • Language-based preferences were collected 3-4x faster (approx. 1 minute) than item-based preferences, suggesting higher efficiency for user elicitation
Breakthrough Assessment
7/10
Provides crucial empirical evidence that LLMs can replace complex collaborative filtering in cold-start settings using interpretable text, though the scale (153 users) is small.
×