Large Language Models are Competitive Near Cold-start Recommenders for Language- and Item-based Preferences

📝 Paper Summary

Cold-start recommendation Conversational recommendation Natural Language User Profiles

Large Language Models using natural language preference descriptions perform competitively with state-of-the-art item-based collaborative filtering for near cold-start recommendation, without needing supervised training.

Core Problem

Traditional recommender systems rely on extensive item rating history (collaborative filtering), which fails in cold-start scenarios where users have few ratings, and lacks transparency.

Why it matters:

Cold-start is a pervasive problem: new users often abandon platforms before generating enough data for good recommendations
Item-based embeddings are inscrutable: users cannot understand or edit the internal vector representation of their preferences
Conversational interfaces are growing, but controlled comparisons of natural language preferences versus traditional item ratings are missing

Concrete Example: A user says 'I like comedy movies because I feel happy' but has no rating history. A traditional Matrix Factorization model cannot recommend anything personalized. An LLM can interpret this text, but it is unclear if it performs as well as if the user had just rated 5 specific comedy movies.

Key Novelty

Unified LLM Prompting for Language & Item Preferences

Collects a parallel dataset where users provide BOTH natural language descriptions of tastes ('I like sci-fi...') AND 5 specific item ratings, allowing direct comparison
Treats recommendation as a conditional generation task where the LLM scores candidate items based on a prompt containing user text descriptions, liked items, or both
Demonstrates that natural language descriptions alone (zero-shot) are sufficient for LLMs to match the performance of collaborative filtering trained on item ratings

Evaluation Highlights

LLM with few-shot (3 examples) prompting achieves 0.572 NDCG@10 on unseen items, statistically tying with strong BPR-SLIM baseline (0.577)
LLM using ONLY language descriptions (Zero-shot) achieves 0.563 NDCG@10 on unseen items, outperforming standard Matrix Factorization (WRMF) at 0.573 and Item-kNN at 0.565 within error margins
Language-based preferences were collected 3-4x faster (approx. 1 minute) than item-based preferences, suggesting higher efficiency for user elicitation

Breakthrough Assessment

7/10

Provides crucial empirical evidence that LLMs can replace complex collaborative filtering in cold-start settings using interpretable text, though the scale (153 users) is small.

⚙️ Technical Details

Problem Definition

Setting: Near cold-start movie recommendation ranking a candidate pool of 40 items

Inputs: User preference P (set of 5 liked items) AND/OR natural language description D ('I like...'), plus a candidate item i

Outputs: Ranked list of items based on log-likelihood scores from the LLM

Pipeline Flow

User Input (Description and/or 5 Items)
Prompt Construction (incorporating input + candidate item)
LLM Scoring (calculate log-likelihood of candidate item)
Ranking (sort candidate pool by score)

System Modules

Prompt Constructor

Formats user preferences into specific templates (Zero-shot, Few-shot, Completion)

Model or implementation: Template-based string formatter

Scoring Engine

Computes the probability of the candidate item name given the prompt context

Model or implementation: PaLM (62B parameters)

Novel Architectural Elements

Unified prompting framework that accepts both unstructured text descriptions and structured item lists as equal-status inputs for measuring compatibility

Modeling

Base Model: PaLM (62 billion parameters)

Training Method: Inference-only prompting (Zero-shot and Few-shot)

Compute: Not reported in the paper

Comparison to Prior Work

vs. P5: Uses a general-purpose pretrained LLM (PaLM) without any fine-tuning, whereas P5 fine-tunes T5 on recommendation tasks
vs. EASE: EASE requires a training matrix of user-item interactions; this approach works zero-shot with just a text description
vs. BM25-Fusion: Outperforms traditional IR baselines that simply match description text to reviews

Limitations

Small sample size (153 raters) limits statistical power and generalizability
Restricted to the Movie domain; performance in other domains is untested
Relies on a very large proprietary model (PaLM 62B), raising cost/latency concerns compared to efficient baselines like EASE
Did not explore 'Chat' based interaction, only static prompt ranking

Reproducibility

The paper uses a proprietary model (PaLM 62B) and an internal dataset collected specifically for this study. While the prompts are described in detail, the specific dataset and model weights are not publicly available.

📊 Experiments & Results

Evaluation Setup

Re-ranking a fixed pool of 40 items (10 popular, 10 mid-popular, 10 personalized EASE, 10 personalized BM25) per user

Benchmarks:

Custom Movie Dataset (Near cold-start recommendation ranking) [New]

Metrics:

NDCG@10 (Normalized Discounted Cumulative Gain)
Statistical methodology: Standard error (95%) reported for all mean NDCG values

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Unseen Items: This is the critical metric for recommendation utility, as recommending already-seen items is trivial. The LLM approaches perform competitively with supervised baselines.
Custom Movie Dataset (Unseen items)	NDCG@10	0.577	0.572	-0.005
Custom Movie Dataset (Unseen items)	NDCG@10	0.565	0.572	+0.007
Custom Movie Dataset (Unseen items)	NDCG@10	0.542	0.563	+0.021
Ablation on Modality: Comparing prompts using only Item lists vs. only Language descriptions vs. Both.
Custom Movie Dataset (Unseen items)	NDCG@10	0.571	0.563	-0.008
Custom Movie Dataset (Unseen items)	NDCG@10	0.571	0.582	+0.011

Main Takeaways

Zero-shot LLMs are competitive with supervised CF: A general-purpose LLM using just text descriptions matches the performance of specialized algorithms trained on rating matrices.
Language preferences are efficient: Users generated text descriptions 3-4x faster than selecting 5 specific items, yet achieved similar recommendation quality.
Few-shot prompting helps: Providing 3 examples (Few-shot) generally outperformed Zero-shot and simple Completion prompting.
Negative preferences don't help much: Explicitly including 'disliked' movies or descriptions did not yield meaningful improvements over positive-only prompts.
No synergy between modalities: Combining language and item preferences (Item+Language) did not significantly outperform the best single-modality prompts.

📚 Prerequisite Knowledge

Prerequisites

Collaborative Filtering (CF) basics
Matrix Factorization
Prompt engineering (Zero-shot vs. Few-shot)
NDCG evaluation metric

Key Terms

Cold-start: The scenario where a recommender system has little to no data about a new user or item, making personalization difficult

Collaborative Filtering: A technique that recommends items based on the preferences of similar users (e.g., 'people who liked X also liked Y')

Zero-shot: Asking a model to perform a task (here, ranking movies) without providing any specific training examples in the prompt

Few-shot: Providing a small number of example inputs and outputs (e.g., 3 users with their preferences and a target movie) in the prompt to guide the model

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items in the list

EASE: Embarrassingly Shallow Autoencoders—a linear model for collaborative filtering that often performs as well as complex deep learning methods

BPR-SLIM: Bayesian Personalized Ranking Sparse Linear Method—a ranking optimization method that learns a sparse weight matrix for item similarities

BM25: A probabilistic information retrieval function that ranks documents based on the query terms appearing in each document

PaLM: A large language model developed by Google (Pathways Language Model), used as the backbone for the prompting experiments

Prompting: The process of structuring text input to an LLM to elicit a specific output or behavior