Zero-Shot Next-Item Recommendation using Large Pretrained Language Models

📝 Paper Summary

LLM-based Recommendation Zero-Shot Learning

LLMs can perform effective zero-shot next-item recommendation by using a multi-step prompting strategy that infers user preferences and ranks items within a pre-filtered candidate set.

Core Problem

Directly using LLMs for recommendation fails because the item space is too large for the context window, and LLMs lack specific knowledge of a user's interaction history.

Why it matters:

Traditional recommender systems require extensive training data and cannot function in zero-shot scenarios (new domains/tasks)
Standard LLM prompting yields poor accuracy (e.g., HR@10 of 0.0297) due to hallucinations and inability to rank huge catalogs
Bridging general-purpose LLM reasoning with specific recommendation tasks is crucial for cold-start scenarios

Concrete Example: If a user has watched 'Toy Story' and 'Shrek', a simple prompt asking 'Recommend 10 movies' might return random popular movies or hallucinations not in the database. The proposed NIR (Next-Item Recommendation) approach first filters to 20 relevant candidates, asks the LLM to summarize that the user likes 'animated comedies', and then ranks the candidates based on that summary.

Key Novelty

Zero-Shot Next-Item Recommendation (NIR) Prompting

Uses an external heuristic (User/Item Filtering) to narrow the millions of items down to a small 'Candidate Set' that fits in the prompt
Decomposes the recommendation task into three explicit LLM reasoning steps: (1) Summarize user taste, (2) Pick representative history, (3) Rank candidates
Enforces a strict output format (e.g., 'watched movie <- candidate movie') to map LLM text generation back to specific database item IDs

Architecture

The 3-Step Zero-Shot NIR Prompting Strategy workflow.

Evaluation Highlights

Outperforms fully trained FPMC baseline by +1.69% in HR@10 (0.1187 vs 0.1018) on MovieLens 100K without any training
Achieves 0.1187 HR@10, performing comparably to strong sequential baselines like GRU4Rec (0.1230) and SASRec (0.1241)
Improves over Simple Prompting by ~4x (0.1187 vs 0.0297 HR@10) by introducing candidate sets and multi-step reasoning

Breakthrough Assessment

7/10

Demonstrates that LLMs can compete with trained baselines in zero-shot settings if the search space is constrained. Novelty lies in the prompting structure rather than architecture.

⚙️ Technical Details

Problem Definition

Setting: Zero-Shot Next-Item Recommendation

Inputs: User's historical sequence of interacted items (movies) and a pre-computed Candidate Set

Outputs: A ranked list of 10 items from the Candidate Set

Pipeline Flow

Group: Pre-processing -> Candidate Set Construction (External)
Group: Inference -> 3-Step LLM Prompting -> Answer Extraction

System Modules

Candidate Set Constructor

Reduce item search space to a manageable size for the prompt context window

Model or implementation: Heuristic (User Filtering or Item Filtering)

GPT-3 Inference (Step 1-3) (Inference)

Reason about preferences and rank items

Model or implementation: text-davinci-003

Answer Extractor (Inference)

Parse LLM text output into structured item list

Model or implementation: Rule-based regex

Novel Architectural Elements

Integration of heuristic-based filtering (UF/IF) directly into the prompt context to solve LLM search space issues
3-step prompt chain that feeds intermediate outputs (User Preference Summary, Representative Movies) into the final recommendation prompt

Modeling

Base Model: GPT-3 (text-davinci-003)

Compute: Inference only. No training performed. Uses OpenAI API.

Comparison to Prior Work

vs. BERT4Rec/SASRec: Zero-Shot NIR requires NO training, whereas sequential models require full training on interaction data
vs. Zhang et al. (GPT-2): Zero-Shot NIR uses a Candidate Set to constrain the space, whereas previous LLM approaches tried to predict from the full item universe
vs. P5 [not cited in paper]: P5 requires fine-tuning on recommendation tasks; Zero-Shot NIR is purely inference-based

Limitations

Reliance on an external candidate generation module (UF/IF) means it is not purely end-to-end LLM-based
Performance is sensitive to the size of the Candidate Set (optimal around 19 items)
Inference cost via GPT-3 API is significantly higher than lightweight trained models like SASRec
Evaluated only on a single dataset (MovieLens 100K) and domain (Movies)

Reproducibility

Code: https://github.com/AGI-Edgerunners/LLM-Next-Item-Rec

Code is publicly available on GitHub. Uses OpenAI's text-davinci-003 API. Dataset is standard MovieLens 100K. Candidate generation logic (UF/IF) is standard collaborative filtering.

📊 Experiments & Results

Evaluation Setup

Next-item recommendation on MovieLens 100K dataset

Benchmarks:

MovieLens 100K (Sequential Recommendation)

Metrics:

HR@10 (Hit Ratio)
NDCG@10 (Normalized Discounted Cumulative Gain)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot performance comparisons demonstrate that the proposed method beats simple baselines and some trained models, while approaching SOTA sequential models.
MovieLens 100K	HR@10	0.0519	0.1187	+0.0668
MovieLens 100K	HR@10	0.0297	0.1187	+0.0890
MovieLens 100K	HR@10	0.1018	0.1187	+0.0169
MovieLens 100K	HR@10	0.1241	0.1187	-0.0054
MovieLens 100K	HR@10	0.0297	0.1187	+0.0890

Experiment Figures

Impact of Candidate Set Size on HR@10 performance.

Main Takeaways

User Filtering (UF) consistently generates better candidate sets than Item Filtering (IF) for the LLM to rank
Multi-step prompting (separating preference summary, representative selection, and ranking) outperforms single-step prompting with the same candidate set
Performance is highly sensitive to Candidate Set size; peaking at ~19 items, with degradation at smaller (15) or larger (22) sizes due to context constraints or choice overload
LLMs can act as effective rankers in zero-shot settings when the search space is pre-filtered by traditional heuristics

📚 Prerequisite Knowledge

Prerequisites

Understanding of Prompt Engineering (Chain-of-Thought)
Basic Recommender Systems concepts (Collaborative Filtering)
Sequential Recommendation metrics (HR, NDCG)

Key Terms

NIR: Next-Item Recommendation—predicting the most likely item a user will interact with next based on their history

Zero-Shot: Performing the task without training the model on domain-specific examples (using a pre-trained LLM 'as is')

Candidate Set: A small subset of items (e.g., 20 movies) pre-selected by a heuristic algorithm to reduce the search space for the LLM

User Filtering (UF): A heuristic that selects candidate items based on what similar users have liked (collaborative filtering)

Item Filtering (IF): A heuristic that selects candidate items similar to those the target user has already interacted with

HR@10: Hit Ratio at 10—the percentage of test cases where the ground-truth item appears in the top 10 recommendations

NDCG@10: Normalized Discounted Cumulative Gain at 10—a ranking metric that gives higher scores when the correct item appears higher in the list