The Whole is Better than the Sum: Using Aggregated Demonstrations in In-Context Learning for Sequential Recommendation

📝 Paper Summary

Sequential Recommendation In-Context Learning (ICL)

LLMSRec-Syn improves sequential recommendation by synthesizing multiple user histories into a single, compact 'aggregated demonstration' prompt, overcoming context length limits and information sparsity.

Core Problem

Standard few-shot in-context learning for sequential recommendation fails to scale: adding more individual user demonstrations degrades performance due to context length limits and information overload.

Why it matters:

LLMs struggle to process long, repetitive prompts containing multiple distinct user histories, often losing focus on relevant details (known as the 'lost in the middle' phenomenon)
Single-user demonstrations are often too sparse to capture the complex patterns needed for accurate recommendation
Existing ICL methods for recommendation perform poorly compared to traditional supervised learning models (like SASRec)

Concrete Example: If a test user likes Sci-Fi, a standard few-shot prompt might stack 3 full histories of other Sci-Fi users. This becomes too long for the LLM, which gets confused or truncates the input. LLMSRec-Syn instead creates one fake 'super-user' history combining the key Sci-Fi interactions from all 3 users.

Key Novelty

Aggregated Demonstrations (LLMSRec-Syn)

Instead of stacking multiple distinct user demonstrations (User A history + User B history), the method merges items from multiple relevant users into a single, synthetic user history sorted chronologically.
This approach reduces token usage by removing repeated instruction boilerplate and presents the LLM with a denser, more informative signal about item transitions.

Architecture

Comparison of Zero-shot, Few-shot, and the proposed Aggregated One-shot frameworks.

Evaluation Highlights

LLMSRec-Syn outperforms standard 1-shot ICL by +16.7% (NDCG@10) on the MovieLens-1M dataset.
Surpasses state-of-the-art zero-shot methods (like Hou et al. 2023) by significant margins across three datasets (ML-1M, Games, LastFM).
Achieves parity with or exceeds supervised baselines (like SASRec) in specific low-data or sparse settings (e.g., on LastFM).

Breakthrough Assessment

7/10

Offers a clever, simple prompting strategy that effectively solves the context-window bottleneck for few-shot recommendation, turning a failure case (more shots = worse performance) into a success.

⚙️ Technical Details

Problem Definition

Setting: Sequential Recommendation as a conditional ranking task

Inputs: A sequence of past interacted items x_i, a set of candidate items c_i, and a ground truth next item y_i

Outputs: A ranking of the items in c_i such that y_i is ranked as high as possible

Pipeline Flow

User Selection (Retrieval)
Demonstration Aggregation
Prompt Construction
LLM Inference

System Modules

Demonstration Retriever (Retrieval & Selection)

Identify training users semantically similar to the test user to serve as demonstrations

Model or implementation: OpenAI text-embedding-ada-002

Aggregator (Retrieval & Selection)

Merge histories of retrieved users into one sequence

Model or implementation: Rule-based (Chronological Merge)

Prompt Generator

Format the aggregated history into a natural language instruction

Model or implementation: Template-based

Ranker

Generate the ranked list of candidate items

Model or implementation: ChatGPT (GPT-3.5-Turbo)

Novel Architectural Elements

Aggregated Demonstration Logic: The specific pipeline step of interweaving multiple user timelines into a single synthetic timeline to compress context while retaining transition patterns

Modeling

Base Model: ChatGPT (GPT-3.5-Turbo)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Hou et al. (2023): LLMSRec-Syn uses aggregated cross-user demonstrations rather than just single-user or self-history demonstrations
vs. Standard Few-Shot: LLMSRec-Syn merges examples into one synthetic user, avoiding the performance degradation observed when simply stacking prompts
vs. SASRec: LLMSRec-Syn is an inference-only method requiring no training, whereas SASRec requires full supervised training

Limitations

Aggregated demonstrations introduce noise if the retrieved users have conflicting preferences
Performance still heavily dependent on the quality of the retriever (finding relevant users)
The chronological merging strategy is heuristic and might disrupt specific sequential signals if users have very different timelines
Limited by the context window of the underlying LLM (though better than standard few-shot)

Reproducibility

Code: https://github.com/demoleiwang/LLMSRec_Syn

📊 Experiments & Results

Evaluation Setup

Sequential recommendation (next-item prediction) using a leave-one-out strategy

Benchmarks:

MovieLens-1M (ML-1M) (Movie Recommendation)
Amazon Games (Product Recommendation)
LastFM-2K (Music Artist Recommendation)

Metrics:

NDCG@10
NDCG@20
Statistical methodology: Experiments repeated 9 times; average results reported. Standard deviation indicated in plots but not explicitly tabulated for main comparison.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MovieLens-1M	NDCG@10	0.4286	0.5002	+0.0716
MovieLens-1M	NDCG@10	0.4700	0.5002	+0.0302
LastFM	NDCG@10	0.6865	0.7145	+0.0280
MovieLens-1M	NDCG@10	0.41	0.50	+0.09

Experiment Figures

Impact of number of demonstrations on standard ICL performance.

Performance of LLMSRec-Syn (Aggregated) as the number of aggregated users increases.

Main Takeaways

Standard In-Context Learning (ICL) scales poorly for recommendation: increasing demonstrations from 1 to 4 causes performance to drop due to context limits.
Task consistency is critical: demonstrations must use the exact same ranking task (T3) as the test instruction; using next-item prediction (T1) or pairwise contrast (T2) in demonstrations hurts performance.
The 'Aggregated Demonstration' strategy successfully compresses information, allowing the model to utilize multiple relevant user histories without overwhelming the context window.
LLMSRec-Syn achieves state-of-the-art results among LLM-based methods and serves as a strong zero-training alternative to supervised models.

📚 Prerequisite Knowledge

Prerequisites

Understanding of In-Context Learning (ICL)
Basics of Sequential Recommendation (predicting next item based on history)
Familiarity with ranking metrics (NDCG)

Key Terms

ICL: In-Context Learning—adapting an LLM to a task by providing examples (demonstrations) in the prompt without updating weights

Aggregated Demonstration: The paper's novel technique of merging multiple users' interaction histories into a single synthetic history to serve as a dense prompt example

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that gives higher scores to correct items appearing earlier in the list

SASRec: Self-Attentive Sequential Recommendation—a strong supervised learning baseline using attention mechanisms

SBERT: Sentence-BERT—a modification of the BERT network that uses siamese networks to derive semantically meaningful sentence embeddings

Zero-shot: Asking the model to perform the task with instructions but no example demonstrations

Few-shot: Asking the model to perform the task with multiple example demonstrations included in the prompt