LLMSRec-Syn improves sequential recommendation by synthesizing multiple user histories into a single, compact 'aggregated demonstration' prompt, overcoming context length limits and information sparsity.
Core Problem
Standard few-shot in-context learning for sequential recommendation fails to scale: adding more individual user demonstrations degrades performance due to context length limits and information overload.
Why it matters:
LLMs struggle to process long, repetitive prompts containing multiple distinct user histories, often losing focus on relevant details (known as the 'lost in the middle' phenomenon)
Single-user demonstrations are often too sparse to capture the complex patterns needed for accurate recommendation
Existing ICL methods for recommendation perform poorly compared to traditional supervised learning models (like SASRec)
Concrete Example:If a test user likes Sci-Fi, a standard few-shot prompt might stack 3 full histories of other Sci-Fi users. This becomes too long for the LLM, which gets confused or truncates the input. LLMSRec-Syn instead creates one fake 'super-user' history combining the key Sci-Fi interactions from all 3 users.
Key Novelty
Aggregated Demonstrations (LLMSRec-Syn)
Instead of stacking multiple distinct user demonstrations (User A history + User B history), the method merges items from multiple relevant users into a single, synthetic user history sorted chronologically.
This approach reduces token usage by removing repeated instruction boilerplate and presents the LLM with a denser, more informative signal about item transitions.
Architecture
Comparison of Zero-shot, Few-shot, and the proposed Aggregated One-shot frameworks.
Evaluation Highlights
LLMSRec-Syn outperforms standard 1-shot ICL by +16.7% (NDCG@10) on the MovieLens-1M dataset.
Surpasses state-of-the-art zero-shot methods (like Hou et al. 2023) by significant margins across three datasets (ML-1M, Games, LastFM).
Achieves parity with or exceeds supervised baselines (like SASRec) in specific low-data or sparse settings (e.g., on LastFM).
Breakthrough Assessment
7/10
Offers a clever, simple prompting strategy that effectively solves the context-window bottleneck for few-shot recommendation, turning a failure case (more shots = worse performance) into a success.
⚙️ Technical Details
Problem Definition
Setting: Sequential Recommendation as a conditional ranking task
Inputs: A sequence of past interacted items x_i, a set of candidate items c_i, and a ground truth next item y_i
Outputs: A ranking of the items in c_i such that y_i is ranked as high as possible
Pipeline Flow
User Selection (Retrieval)
Demonstration Aggregation
Prompt Construction
LLM Inference
System Modules
Demonstration Retriever (Retrieval & Selection)
Identify training users semantically similar to the test user to serve as demonstrations
Model or implementation: OpenAI text-embedding-ada-002
Aggregator (Retrieval & Selection)
Merge histories of retrieved users into one sequence
Model or implementation: Rule-based (Chronological Merge)
Prompt Generator
Format the aggregated history into a natural language instruction
Model or implementation: Template-based
Ranker
Generate the ranked list of candidate items
Model or implementation: ChatGPT (GPT-3.5-Turbo)
Novel Architectural Elements
Aggregated Demonstration Logic: The specific pipeline step of interweaving multiple user timelines into a single synthetic timeline to compress context while retaining transition patterns
Modeling
Base Model: ChatGPT (GPT-3.5-Turbo)
Compute: Not reported in the paper
Comparison to Prior Work
vs. Hou et al. (2023): LLMSRec-Syn uses aggregated cross-user demonstrations rather than just single-user or self-history demonstrations
vs. Standard Few-Shot: LLMSRec-Syn merges examples into one synthetic user, avoiding the performance degradation observed when simply stacking prompts
vs. SASRec: LLMSRec-Syn is an inference-only method requiring no training, whereas SASRec requires full supervised training
Limitations
Aggregated demonstrations introduce noise if the retrieved users have conflicting preferences
Performance still heavily dependent on the quality of the retriever (finding relevant users)
The chronological merging strategy is heuristic and might disrupt specific sequential signals if users have very different timelines
Limited by the context window of the underlying LLM (though better than standard few-shot)
Sequential recommendation (next-item prediction) using a leave-one-out strategy
Benchmarks:
MovieLens-1M (ML-1M) (Movie Recommendation)
Amazon Games (Product Recommendation)
LastFM-2K (Music Artist Recommendation)
Metrics:
NDCG@10
NDCG@20
Statistical methodology: Experiments repeated 9 times; average results reported. Standard deviation indicated in plots but not explicitly tabulated for main comparison.
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
MovieLens-1M
NDCG@10
0.4286
0.5002
+0.0716
MovieLens-1M
NDCG@10
0.4700
0.5002
+0.0302
LastFM
NDCG@10
0.6865
0.7145
+0.0280
MovieLens-1M
NDCG@10
0.41
0.50
+0.09
Experiment Figures
Impact of number of demonstrations on standard ICL performance.
Performance of LLMSRec-Syn (Aggregated) as the number of aggregated users increases.
Main Takeaways
Standard In-Context Learning (ICL) scales poorly for recommendation: increasing demonstrations from 1 to 4 causes performance to drop due to context limits.
Task consistency is critical: demonstrations must use the exact same ranking task (T3) as the test instruction; using next-item prediction (T1) or pairwise contrast (T2) in demonstrations hurts performance.
The 'Aggregated Demonstration' strategy successfully compresses information, allowing the model to utilize multiple relevant user histories without overwhelming the context window.
LLMSRec-Syn achieves state-of-the-art results among LLM-based methods and serves as a strong zero-training alternative to supervised models.
📚 Prerequisite Knowledge
Prerequisites
Understanding of In-Context Learning (ICL)
Basics of Sequential Recommendation (predicting next item based on history)
Familiarity with ranking metrics (NDCG)
Key Terms
ICL: In-Context Learning—adapting an LLM to a task by providing examples (demonstrations) in the prompt without updating weights
Aggregated Demonstration: The paper's novel technique of merging multiple users' interaction histories into a single synthetic history to serve as a dense prompt example
NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that gives higher scores to correct items appearing earlier in the list