RecPrompt: A Self-tuning Prompting Framework for News Recommendation Using Large Language Models

📝 Paper Summary

LLM-based News Recommendation Automated Prompt Engineering

RecPrompt improves news recommendation by using an iterative two-LLM loop where a prompt optimizer refines the recommender's instructions based on feedback, alongside a new metric for evaluating topic explanations.

Core Problem

Existing LLM-based news recommenders often fail to surpass deep neural baselines and require labor-intensive manual prompt engineering that may not align with user interests.

Why it matters:

Fine-tuning LLMs is resource-intensive and requires high-quality rationales which are hard to produce
Current evaluation metrics focus purely on ranking and lack ground truth for assessing the quality and explainability of generated topic summaries
Manual prompt design is static and may not optimally leverage the LLM's reasoning capabilities for specific recommendation tasks

Concrete Example: A user interested in 'Sports' clicks news H1 and H3. A standard recommender might correctly rank news but fail to explain why. RecPrompt's optimizer detects if the 'topics' in the explanation are unclear or misaligned with click behavior and rewrites the prompt to force the recommender to better summarize user-interest topics.

Key Novelty

Self-tuning Prompting Loop (RecPrompt)

Iterative bootstrapping process involving two LLMs: one acts as the Recommender making predictions, and the other as an Optimizer refining the Recommender's prompt templates
Introduction of a Monitor component that tracks performance metrics (MRR, nDCG) to accept or reject prompt updates, ensuring monotonic improvement
Development of TopicScore, a novel metric to evaluate the explainability of recommendations by measuring the correctness and completeness of summarized interest topics

Architecture

The RecPrompt framework workflow.

Evaluation Highlights

+3.36% improvement in AUC and +10.49% in MRR compared to traditional deep neural models on the MIND dataset
CoT-LLM_rec-4 (GPT-4 with Chain-of-Thought) surpasses all deep neural baselines using zero-shot prompting without training on recommendation data
RecPrompt outperforms standard prompting strategies (Input-Output and Chain-of-Thought) by iteratively refining the instruction template

Breakthrough Assessment

7/10

Strong improvements over deep learning baselines using a novel self-tuning loop. The introduction of an explainability metric (TopicScore) fills a gap, though the method relies on closed-source models (GPT-3.5/4).

⚙️ Technical Details

Problem Definition

Setting: News recommendation prediction and explanation generation

Inputs: User history H_u = {nw_i} (titles/categories) and Candidate news D_u = {(nw_j, y_j)}

Outputs: Ranked list of candidate news R_u' and explanations summarizing topics TP_u

Pipeline Flow

News Recommender (Generates rank + explanation)
Monitor (Evaluates performance & records best template)
Prompt Optimizer (Refines template based on performance)

System Modules

News Recommender (LLM_rec)

Generates ranked news list and topic explanations based on current prompt template

Model or implementation: GPT-3.5 (gpt-3.5-turbo-1106) or GPT-4 (gpt-4-1106-preview)

Monitor

Evaluates current recommendations against ground truth to decide if the new prompt template should be kept

Model or implementation: Algorithmic component (Metric calculation)

Prompt Optimizer (LLM_opt)

Generates a refined template instruction to improve the recommender's future performance

Model or implementation: GPT-3.5 or GPT-4

Novel Architectural Elements

Closed-loop iterative bootstrapping where the optimizer LLM modifies the recommender LLM's system prompt based on validation set performance metrics monitored by a separate module

Modeling

Base Model: GPT-3.5 (gpt-3.5-turbo-1106) and GPT-4 (gpt-4-1106-preview)

Training Method: In-context learning / Prompt Optimization (No weight updates)

Adaptation: Prompt Engineering (Self-tuning)

Trainable Parameters: None (Frozen LLMs)

Key Hyperparameters:

iterations (l): 10
shots: Zero-shot

Compute: Not reported in the paper

Comparison to Prior Work

vs. Deep Neural Models (LSTUR, NAML): RecPrompt uses frozen LLMs with dynamic prompts rather than training dense representations
vs. Static Prompting (IO, CoT): RecPrompt iteratively optimizes the prompt wording based on feedback rather than using fixed templates
vs. OPRO [not cited in paper]: Similar to OPRO (Optimization by PROmpting) but specifically tailored for news recommendation with topic-aware observation instructions

Limitations

Reliance on commercial, closed-source APIs (GPT-3.5/4) with associated costs
Inference latency is high due to LLM calls compared to lightweight neural rankers
Experiments limited to a small subset of users (400 test users) from the MIND dataset
Maximum context length of LLMs may limit the number of history/candidate items processed

Reproducibility

Code: https://github.com/Ruixinhua/rec-prompt

Code is publicly available at https://github.com/Ruixinhua/rec-prompt. Uses OpenAI API (GPT-3.5/4), so exact reproduction depends on API version availability. Prompt templates are described in the paper.

📊 Experiments & Results

Evaluation Setup

News recommendation on MIND dataset. Validation set: 100 users for prompt optimization. Test set: 400 users.

Benchmarks:

MIND (MIcrosoft News Dataset) (News Recommendation)

Metrics:

AUC
MRR
nDCG@5
nDCG@10
TopicScore (Correctness, Completeness)
Statistical methodology: Experiments conducted 3 times, average performance reported. No significance tests explicitly reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison against deep neural baselines (LSTUR, NAML, NRMS) and traditional baselines (TopicPop). RecPrompt with GPT-4 (CoT-LLM_rec-4) achieves the highest performance.
MIND	AUC	0.6865	0.7096	+0.0231
MIND	MRR	0.3164	0.3496	+0.0332
MIND	nDCG@10	0.3804	0.4040	+0.0236
Ablation study on the effect of the optimizer LLM version and category information.
MIND	AUC	0.6273	0.6558	+0.0285
MIND	AUC	0.6698	0.6973	+0.0275

Experiment Figures

Comparison of TopicScore (Correctness and Completeness) evaluated by Human, LLM_eval-3.5, and LLM_eval-4.

Main Takeaways

RecPrompt consistently improves recommendation performance over static prompting (IO and CoT) by iteratively refining the prompt.
GPT-4 as a recommender (LLM_rec-4) significantly outperforms GPT-3.5 and traditional deep neural models in zero-shot settings.
Including news category information in the prompt is crucial for performance, aiding the model in matching user interests.
TopicScore evaluation shows that LLM-generated explanations are highly rated by both human annotators and LLM judges for correctness and completeness.

📚 Prerequisite Knowledge

Prerequisites

News Recommendation Systems
Large Language Models (LLMs)
Prompt Engineering (Zero-shot, CoT)
Information Retrieval Metrics (AUC, MRR, nDCG)

Key Terms

RecPrompt: The proposed self-tuning prompting framework involving a Recommender, Optimizer, and Monitor

TopicScore: A proposed metric to evaluate explainability by measuring the correctness and completeness of summarized topics against news content and user history

MIND: MIcrosoft News Dataset—a large-scale benchmark dataset for news recommendation

AUC: Area Under the Curve—a performance metric evaluating the probability that a positive instance is ranked higher than a negative one

MRR: Mean Reciprocal Rank—a metric evaluating the rank of the first correct recommendation

nDCG: normalized Discounted Cumulative Gain—a metric evaluating the quality of ranking, giving more weight to top-ranked items

CoT: Chain of Thought—a prompting strategy that encourages the model to generate intermediate reasoning steps

IO prompting: Input-Output prompting—a simple strategy asking for direct textual responses without intermediate reasoning

Zero-shot prompting: Asking the model to perform a task without providing any specific training examples in the prompt