Reinforced Prompt Personalization for Recommendation with Large Language Models

📝 Paper Summary

LLM-based Recommendation Prompt Optimization

RPP optimizes prompts for individual users by employing multi-agent reinforcement learning to dynamically select sentence-level patterns for role-playing, history, reasoning, and formatting.

Core Problem

Most LLM recommenders use fixed 'task-wise' prompts shared across all users, which fails to capture individual dynamic preferences and sensitivities to prompt phrasing.

Why it matters:

Fixed templates (e.g., fixed history length) ignore that some users have short-term interests while others have long-term preferences
LLM performance is highly sensitive to prompt expression; a one-size-fits-all approach sacrifices potential performance gains from tailored wording
Manual prompt engineering is labor-intensive, while supervised learning lacks 'optimal prompt' labels for training

Concrete Example: A user preferring science fiction based on movies from two weeks ago needs a long history context, while a user preferring comedy based on the last two films needs a short history context. A fixed task-wise prompt with a static history length fails to serve both.

Key Novelty

Reinforced Prompt Personalization (RPP/RPP+)

Frames prompt generation as a Multi-Agent Reinforcement Learning (MARL) problem where four agents collaboratively select sentences for different prompt patterns (Role, History, Reasoning, Format)
Reduces search space by optimizing at the sentence level rather than token level, ensuring generated prompts are coherent and grammatically correct
RPP+ introduces a dynamic 'refine' block where an LLM polishes the selected actions (sentences) during iterations to improve flexibility

Architecture

The RPP framework overview illustrating the Multi-Agent RL process.

Breakthrough Assessment

7/10

Novel application of MARL to personalize prompts per user instance (not just per task). Addresses the search space issue of RL-based prompting effectively.

⚙️ Technical Details

Problem Definition

Setting: Instance-wise prompt optimization for Item Ranking

Inputs: User interaction history H and candidate items C

Outputs: Optimal prompt p* that maximizes alignment between LLM ranking results and ground truth

Pipeline Flow

State Encoder (User Embeddings + Previous Prompt/Result)
Multi-Agent Action Selection (Role, History, Reasoning, Format)
Refiner LLM (RPP+ only)
Recommender LLM (Environment)

System Modules

State Encoder

Encode user features and current environment state into a shared representation

Model or implementation: BERT (for text) + GRU (for sequences) + LightGCN (for user embeddings)

MARL Agents (x4)

Select specific sentences for the four prompt patterns

Model or implementation: Actor-Critic Networks (A2C)

Refiner

Dynamically polish selected sentences to be more effective

Model or implementation: LLM (Unspecified, likely same as recommender)

Recommender

Generate item rankings based on the personalized prompt

Model or implementation: LLM-based Recommender

Novel Architectural Elements

Decomposition of prompt optimization into 4 discrete sub-agents (Role, History, Reasoning, Format) trained via MARL
Hybrid state representation combining static user embeddings (LightGCN) with dynamic interaction embeddings (BERT/GRU)

Modeling

Base Model: LLM (Specific model not detailed in text, generally applicable)

Training Method: Multi-Agent Reinforcement Learning (MARL) with A2C under CTDE

Objective Functions:

Purpose: Maximize cumulative reward (ranking performance) for the Actor.

Formally: L_a = - log(prob) * (R - V)
Purpose: Minimize prediction error for the Critic (Value estimation).

Formally: L_c = (R - V)^2
Purpose: Reward signal based on ranking quality.

Formally: r_t = NDCG@M evaluated on LLM output

Key Hyperparameters:

reward_metric: NDCG@10

Compute: Not reported in the paper

Comparison to Prior Work

vs. Task-wise: RPP generates unique prompts per user instance
vs. Token-level RL (e.g., RLPrompt): RPP optimizes at sentence level to ensure fluency and reduce search space
vs. Manual: RPP automates the search process using RL

Limitations

Search space is constrained to the pre-defined pool of sentences (actions) in RPP (though RPP+ mitigates this via refinement)
Requires an iterative interaction process with the LLM which increases computational cost compared to fixed prompting

Reproducibility

Code: https://github.com/maowenyu-11/RPP

Code available at https://github.com/maowenyu-11/RPP. Action space candidates for 'Reasoning' and 'Role-playing' are explicitly listed in the text.

📊 Experiments & Results

Evaluation Setup

Item Ranking Task

Benchmarks:

MovieLens-1M (Movie Recommendation)
Games (Game Recommendation (Amazon))
Lastfm (Music Recommendation)

Metrics:

NDCG@10
HR@10 (implied by typical ranking setups, though text specifies NDCG reward)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

RPP/RPP+ significantly improves recommendation performance over traditional models (e.g., LightGCN), few-shot methods (e.g., VQ-Rec), and fixed prompt baselines.
Sentence-level optimization balances search efficiency with prompt quality compared to token-level search.
The 'Refine' block in RPP+ enhances scalability by dynamically adjusting selected actions.
Personalizing history length (History pattern) effectively captures differences between short-term and long-term user preferences.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) basics (State, Action, Reward)
Large Language Models (LLMs) for Recommendation
Prompt Engineering patterns (CoT, Role-playing)

Key Terms

RPP: Reinforced Prompt Personalization—the proposed framework using RL to select prompt sentences

MARL: Multi-Agent Reinforcement Learning—multiple agents learning policies simultaneously, here used to handle different prompt parts

CTDE: Centralized Training with Decentralized Execution—a paradigm where agents train with global information but act independently

A2C: Advantage Actor-Critic—an RL algorithm combining policy-based and value-based methods

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality

CoT: Chain-of-Thought—a prompting technique encouraging LLMs to show reasoning steps

Task-wise prompting: Using a single fixed prompt template for all users in a specific task

Instance-wise prompting: Generating a unique, personalized prompt for each specific user/inference instance