Large Language Models for Intent-Driven Session Recommendations

📝 Paper Summary

Session-based Recommendation (SR) Intent-aware Recommendation LLM Prompt Optimization

PO4ISR automates prompt optimization for session-based recommendation by using LLMs to self-reflect on error cases, refine prompts iteratively, and transfer the best prompts across domains.

Core Problem

Traditional intent-aware session recommendation methods assume a fixed number of latent intents and lack transparency, while existing LLM approaches rely on static prompts that fail to capture dynamic user intents.

Why it matters:

Real-world sessions have varying, dynamic intents (e.g., buying a laptop vs. buying multiple unrelated gifts) that fixed-intent models miss
Latent embedding spaces in traditional models are opaque, making recommendations hard to explain or interpret
Manually designing optimal prompts for LLMs is tedious and often suboptimal compared to automated refinement

Concrete Example: A user buys a laptop, then a camera. A standard model might assume a single 'electronics' intent. PO4ISR's prompt explicitly instructs the LLM to identify multiple intents (laptop accessories vs. camera accessories) and rank the next item (a camera lens) higher than a laptop bag.

Key Novelty

Prompt Optimization for Intent-aware Session Recommendation (PO4ISR)

Iteratively optimizes prompts by asking the LLM to analyze its own recommendation errors, infer reasons for failure, and rewrite the prompt to address those specific weaknesses
Evaluates prompt candidates using a UCB (Upper Confidence Bound) bandit algorithm to efficiently identify high-performing prompts without testing every prompt on the full dataset
Leverages cross-domain transfer by selecting the best-performing prompt from a source domain (e.g., Games) to use in target domains, exploiting LLM generalizability

Architecture

The overall PO4ISR pipeline including initialization, the optimization loop (error collection -> reasoning -> refinement -> evaluation), and selection.

Evaluation Highlights

+57.37% average improvement in HR@5 and +61.03% in NDCG@5 over state-of-the-art baselines across three datasets
Outperforms standard zero-shot prompting (NIR) by ~121% on Games dataset (NDCG@1), proving the value of iterative optimization over static prompts
Cross-domain prompt selection works effectively: The prompt optimized on the Games dataset achieved the best performance even when applied to the Movie and Bundle domains

Breakthrough Assessment

7/10

Significant performance jumps over non-LLM baselines and static prompting. The application of automatic prompt optimization (APO) to recommendation is novel and practical, though the core mechanism borrows from NLP optimization techniques.

⚙️ Technical Details

Problem Definition

Setting: Session-based recommendation (SR) predicting the next item in a sequence

Inputs: Anonymous behavior session S = [i_1, i_2, ..., i_m] and a candidate item set C

Outputs: Ranked list of items from C based on probability of being the next interaction i_{m+1}

Pipeline Flow

PromptInit (Initialize task description)
PromptOpt (Iteratively refine prompts via self-reflection)
PromptSel (Select best prompt across domains)

System Modules

PromptInit

Generate initial task description instructing LLM to predict next item based on session intents

Model or implementation: GPT-3.5-turbo

PromptOpt (Reasoning & Refinement) (Optimization)

Analyze error cases and generate refined prompts

Model or implementation: GPT-3.5-turbo

PromptOpt (Evaluation) (Optimization)

Efficiently estimate performance of new prompts

Model or implementation: UCB Bandit Algorithm

PromptSel

Select the final prompt for inference

Model or implementation: Selection Logic

Novel Architectural Elements

Iterative self-reflective prompt optimization loop specifically for recommendation (Error -> Reason -> Refine -> Augment)
Cross-domain prompt selection mechanism leveraging LLM generalizability to use a single domain's optimized prompt for all tasks

Modeling

Base Model: GPT-3.5-turbo (ChatGPT)

Training Method: Prompt Optimization (In-Context Learning / Zero-Shot)

Adaptation: None (uses API-based prompting)

Trainable Parameters: None (prompts are discrete text optimization)

Training Data:

Uses subsets of training data for prompt optimization: 50 sessions for PO4ISR vs 150 for baselines

Key Hyperparameters:

N_t (batch size for evaluation): 32
N_r (reasons per error): 2
N_o (top prompts kept): 4
+ 3 more
E1 (UCB max epochs): 16
E2 (optimization iterations): 2
candidate_size: 20

Compute: Not reported in the paper

Comparison to Prior Work

vs. NIR: Adds iterative optimization and self-reflection loops instead of static prompting
vs. MCPRN/GCE-GNN: Uses semantic reasoning of LLMs rather than latent embeddings; does not require training on large datasets
vs. Automatic Prompt Optimization (APO) [not cited in paper]: Adapts general NLP prompt optimization (like APO/ProTeGi) specifically to the ranking/recommendation domain

Limitations

Hallucination persists: ~5.5% of sessions have invalid outputs even with hard constraints
Reliance on commercial APIs (GPT-3.5) creates cost and reproducibility dependency
Inference latency is high due to LLM generation (not explicitly measured but inherent to method)
Performance gain depends on the quality of the initial prompt (better initial prompts yield smaller relative gains)

Reproducibility

Code: https://github.com/llm4sr/PO4ISR

Code and data available at https://github.com/llm4sr/PO4ISR. Uses OpenAI API (GPT-3.5-turbo). Prompts are fully described in the paper.

📊 Experiments & Results

Evaluation Setup

Next-item prediction on session datasets

Benchmarks:

MovieLens-1M (ML-1M) (Movie recommendation)
Amazon Games (Video game recommendation)
Amazon Bundle (E-commerce recommendation (Electronic, Clothing, Food))

Metrics:

Hit Rate @ 1, 5 (HR@K)
NDCG @ 1, 5
Statistical methodology: p-value significance testing reported in Table 2

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
PO4ISR consistently outperforms all baselines across three domains, with particularly large gains on sparse datasets (Games, Bundle).
ML-1M	NDCG@5	0.3501	0.3810	+0.0309
Amazon Games	NDCG@5	0.2310	0.4313	+0.2003
Amazon Bundle	NDCG@5	0.1939	0.3040	+0.1101
Ablation studies confirm the value of optimization and cross-domain selection.
Amazon Games	NDCG@5	0.2110	0.4381	+0.2271
ML-1M	NDCG@5	0.3662	0.3810	+0.0148

Experiment Figures

Cross-domain performance of the Top-1 prompt from each domain applied to the others.

Case study of a specific session showing the LLM's reasoning trace.

Main Takeaways

PO4ISR excels on sparser datasets (Games) with shorter sessions, showing LLMs' ability to reason with limited data where traditional collaborative filtering fails
Cross-domain transfer is highly effective: The prompt optimized on Games worked best for ALL domains, suggesting it captured universal reasoning patterns
Simplified initial prompts and task descriptions lead to better optimization trajectories than complex ones
Lower-quality initial prompts see larger relative gains from optimization, but better initialization still yields higher absolute performance

📚 Prerequisite Knowledge

Prerequisites

Session-based Recommendation (SR)
Large Language Models (LLMs) and Prompt Engineering
Multi-Armed Bandits (UCB)

Key Terms

SR: Session-based Recommendation—predicting the next user action based on a short sequence of anonymous interactions

ISR: Intent-aware Session Recommendation—SR approaches that explicitly try to model the user's underlying purpose or intent

UCB Bandits: Upper Confidence Bound—an algorithm for decision making that balances exploring new options (prompts with uncertain performance) and exploiting known good ones

Zero-shot prompting: Asking an LLM to perform a task without providing any training examples in the prompt

Hallucination: When an LLM generates content that is factually incorrect or not grounded in the input (e.g., recommending items not in the candidate list)

NDCG: Normalized Discounted Cumulative Gain—a ranking metric that gives higher scores to correct items appearing at the top of the list