Automating Personalization: Prompt Optimization for Recommendation Reranking

📝 Paper Summary

Recommendation Reranking Prompt Optimization

AGP improves recommendation reranking by automatically refining the user profile generation prompt using batched, position-aware feedback that identifies specific ranking errors.

Core Problem

LLM-based reranking relies on manually crafted prompts that fail to scale or capture nuanced preferences from noisy item metadata, while existing optimization methods use aggregated metrics that lack actionable guidance.

Why it matters:

Manual prompt engineering is labor-intensive, static, and prone to trial-and-error, limiting scalability across diverse user behaviors.
Standard optimization metrics like NDCG provide only a general score, failing to tell the LLM *why* a specific ranking was poor or how to fix it.
Unstructured metadata (e.g., noisy titles) makes it difficult for standard prompts to infer accurate user profiles.

Concrete Example: If a user likes 'Sci-Fi' but the LLM ranks a relevant movie 5th instead of 1st, standard methods simply report a lower NDCG score. AGP generates specific feedback stating 'Item ranked 5th should be 1st', prompting the system to refine the profile generator to better emphasize 'Sci-Fi' preferences.

Key Novelty

Auto-Guided Prompt Refinement (AGP)

Optimizes the *user profile generation* prompt rather than the final reranking prompt, allowing the LLM to better summarize preferences before ranking.
Uses *Position-Based Feedback* to generate explicit textual instructions based on the gap between an item's predicted rank and its ideal rank.
employs *Batched Training* to aggregate feedback across multiple users, preventing the prompt from overfitting to individual user quirks.

Evaluation Highlights

Achieves improvements of 5.61–20.68% in NDCG@10 over baseline models (LightGCN, SASRec) across Amazon, Yelp, and Goodreads datasets.
Demonstrates high data efficiency by reaching optimal performance with only 100 training users.
Enhances graph-based recommenders (LightGCN) significantly, showing 9.36–20.68% gains by injecting semantic personalization into collaborative filtering results.

Breakthrough Assessment

7/10

Effective application of LLM self-optimization to recommendation. The shift from optimizing reranking directly to optimizing profile generation with position-based feedback is a clever, interpretable design choice yielding strong results.

⚙️ Technical Details

Problem Definition

Setting: Reranking a candidate list provided by a base recommender to improve personalization.

Inputs: User interaction history H(u) (sequence of item titles) and a baseline ranking list R_base(u).

Outputs: An optimized reranked list R_LLM(u).

Pipeline Flow

Profile Generation: History + Prompt -> User Profile
Reranking: Profile + Candidate List -> Reranked List
Feedback (Training only): Reranked List vs Ground Truth -> Position-based Signals -> Prompt Update

System Modules

User Profile Generator

Synthesize a structured user profile from raw interaction history using a learnable prompt.

Model or implementation: LLM (e.g., GPT-4o)

Reranker

Reorder the candidate list based on the generated user profile.

Model or implementation: LLM (e.g., GPT-4o)

Novel Architectural Elements

Two-stage pipeline where the optimization targets the *first* stage (Profile Generation) rather than the final task (Reranking).
Integration of explicit position-based error signals into the prompt update loop.

Modeling

Base Model: Evaluated with GPT-4o, GPT-4o-Mini, GPT-o3-Mini, and DeepSeek-V3

Key Hyperparameters:

training_users: 100
max_epochs: 10
batch_size: 5, 10, or 20
+ 1 more
sequence_length: 5, 10, or 20 items

Compute: Total API calls formula: Epochs * (3 * |U| + 2 * (|U|/BatchSize)). Efficient training with only 100 users.

Comparison to Prior Work

vs. RecPrompt: AGP handles unstructured/noisy metadata instead of structured news topics; optimizes profile generation instead of just the prompt.
vs. LLM-Dir/CoT: AGP is iterative and learns a global prompt from batched feedback, whereas Dir/CoT are static inference methods.
vs. Input Reranking (Dehghankar): AGP provides structured preference modeling feedback rather than just reordering based on exposure constraints.

Limitations

Effectiveness depends on the quality of textual item metadata (e.g., Yelp's noisy descriptions limit gains).
Reliance on commercial LLM APIs (GPT-4o) for optimization may be costly.
Graph-based baselines (LightGCN) benefit more than sequential baselines (SASRec), suggesting redundancy with sequence-aware models.

Reproducibility

No code URL provided in the text. Datasets (Amazon, Yelp, Goodreads) are public. Prompts and specific prompt evolution examples are not fully detailed in the snippet.

📊 Experiments & Results

Evaluation Setup

Reranking top-10 predictions from baseline recommenders (LightGCN, SASRec) using a Leave-One-Out strategy.

Benchmarks:

Amazon Movies & TV (Movie Recommendation)
Yelp (Business Recommendation)
Goodreads (Book Recommendation)

Metrics:

NDCG@10
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Amazon Movies & TV	NDCG@10	0.696	0.705	+0.009

Experiment Figures

Impact of Position-Based Feedback (PBF) on NDCG@10 and Average Rank.

Impact of Summarization in Batched Training.

Main Takeaways

AGP consistently outperforms static prompting baselines (LLM-Dir, LLM-CoT) by self-optimizing the profile generation prompt.
Position-based feedback is more effective than aggregate metrics (like NDCG) for optimization because it provides interpretable, item-level correction signals.
The method is highly data-efficient, achieving competitive results with just 100 training users.
Performance gains are higher on graph-based baselines (LightGCN) than sequential ones (SASRec), likely because AGP adds sequential/semantic reasoning that LightGCN lacks.

📚 Prerequisite Knowledge

Prerequisites

Recommender Systems (Candidate Generation vs. Reranking)
Large Language Models (In-context learning)
Evaluation Metrics (NDCG)

Key Terms

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that gives more credit to correct items ranked higher in the list.

LightGCN: A graph convolutional network designed for recommendation that learns user/item embeddings from the interaction graph.

SASRec: Self-Attentive Sequential Recommendation—a model that uses attention mechanisms to capture sequential patterns in user actions.

Reranking: The process of re-ordering a preliminary list of recommendations to better align with specific user preferences.

Position-Based Feedback: A feedback mechanism that calculates the difference between an item's predicted rank and its target rank to generate corrective instructions.

User Profile Generation: The intermediate step of summarizing a user's raw interaction history into a structured textual profile (e.g., 'likes Sci-Fi').

CoT: Chain-of-Thought—a prompting technique where the model is encouraged to think step-by-step.

Batched Training: Updating the prompt based on feedback aggregated from a group of users rather than a single user, improving generalization.