LLM-Enhanced Reinforcement Learning for Long-Term User Satisfaction in Interactive Recommendation

📝 Paper Summary

Interactive Recommender Systems (IRS) Long-term user satisfaction Hierarchical Reinforcement Learning

LERL combines a high-level LLM planner that ensures semantic diversity with a low-level RL agent that optimizes fine-grained item ranking, preventing filter bubbles and improving long-term satisfaction.

Core Problem

Interactive recommender systems often overfit short-term feedback, leading to filter bubbles and content homogeneity that degrade long-term user satisfaction.

Why it matters:

Users trapped in filter bubbles experience cognitive fatigue and reduced novelty, causing them to leave platforms eventually.
Existing RL methods struggle with sparse, long-tail data and lack semantic planning capabilities, while LLMs struggle to ground abstract plans into fine-grained item actions.
Current diversity-enhancing methods (like re-ranking) typically operate in static or one-shot settings, failing to optimize for dynamic, long-term preference evolution.

Concrete Example: A user watches several sci-fi movies. A standard RL agent might recommend *only* sci-fi movies to maximize immediate clicks, eventually boring the user. LERL's planner would intervene to inject a different category (e.g., 'documentary') based on semantic reflection, while the low-level agent selects the best specific documentary for that user.

Key Novelty

LLM-Enhanced Reinforcement Learning (LERL)

Hierarchical decomposition: Uses an LLM as a 'manager' to select broad content categories (semantic planning) and a traditional RL agent as a 'worker' to pick specific items within those categories.
Reflective Critic: Instead of just a scalar reward, the high-level critic generates textual 'reflections' on past user sessions to guide the LLM planner toward better long-term strategies.
Constrained Action Space: The LLM narrows the search space for the RL agent, enforcing diversity constraints that the RL agent might otherwise learn too slowly or not at all.

Architecture

The overall architecture of LERL, illustrating the interaction between the High-Level Semantic Planner (LLM) and the Low-Level Policy Learner (RL).

Evaluation Highlights

Outperforms state-of-the-art baselines (including recent RL and LLM methods) in long-term cumulative reward on KuaiRec and Kwaishou datasets.
Significantly reduces filter bubble effects, maintaining higher category diversity over long interaction trajectories compared to standard RL approaches.
Ablation studies confirm that removing the high-level LLM planner leads to a sharp drop in performance, validating the necessity of semantic guidance.

Breakthrough Assessment

7/10

Strong conceptual combination of LLM planning and RL execution for a critical problem (filter bubbles). While the architecture is novel, the evaluation relies on simulated environments due to the difficulty of live testing.

⚙️ Technical Details

Problem Definition

Setting: Interactive Recommendation modeled as a Markov Decision Process (MDP) over a sequence of time steps t

Inputs: User interaction history H_t (sequence of past items and feedback)

Outputs: Recommendation list a_t containing a subset of items

Pipeline Flow

High-Level Semantic Planner (LLM) selects category constraints
Low-Level Policy Learner (RL) generates item recommendations within constraints
Environment returns feedback; High-Level Critic generates textual reflections

System Modules

High-Level Actor (Semantic Planning)

Selects a subset of content categories to recommend based on history and reflections

Model or implementation: LLM (Specific model not explicitly named in extracted text, implies standard instruction-tuned LLM)

High-Level Critic (Semantic Planning)

Generates textual reflections to critique and improve the high-level planner

Model or implementation: LLM (Prompt-based)

Low-Level Actor (Policy Learning)

Selects specific items to recommend given the category constraints

Model or implementation: Transformer-based Encoder + MLP Head

Low-Level Critic (Policy Learning)

Estimates the state-value function to guide PPO training

Model or implementation: MLP (Multi-Layer Perceptron)

Novel Architectural Elements

Hierarchical integration where LLM output defines the valid action space (category mask) for the RL agent
Reflection-driven prompting mechanism: Using past session summaries (reflections) as in-context examples to improve the planner's long-term strategy

Modeling

Base Model: Pretrained LLM (for High-Level Planner) + Transformer (for Low-Level RL Agent)

Training Method: Hierarchical RL (LLM prompting + PPO for low-level)

Objective Functions:

Purpose: Optimize low-level policy to maximize long-term reward.

Formally: PPO clipped surrogate objective L^CLIP
Purpose: Minimize error in value estimation.

Formally: Squared error between estimated value and temporal difference target

Key Hyperparameters:

discount_factor_gamma: Not explicitly reported in the paper text provided
PPO_clip_epsilon: Not explicitly reported in the paper text provided
reflection_sample_size_Ns: Not explicitly reported in the paper text provided

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. KCRL/HER4IF: LERL uses a hierarchical LLM planner for semantic diversity rather than just graph/fairness constraints.
vs. CIRS/DNaIR: LERL explicitly plans categories via LLM reasoning rather than relying solely on intrinsic rewards or penalty terms in the scalar reward function.
vs. Static Re-ranking (TD-VAE-CF, DOR): LERL optimizes sequentially for long-term satisfaction rather than one-shot diversity.

Limitations

Relies on a simulated offline environment, which may not perfectly capture real human behavioral shifts.
The high-level planner's inference cost (LLM calls) is significantly higher than pure RL approaches, potentially affecting real-time latency.
Requires mapping items to clear semantic categories; performance depends on the quality of this taxonomy.

Reproducibility

Code: https://github.com/1163710212/LERL

Code is publicly available at https://github.com/1163710212/LERL. The paper uses a simulated environment based on logged datasets (KuaiRec, Kwaishou) to avoid the cost of online user feedback. Specific hyperparameters like learning rates or batch sizes are not detailed in the provided text.

📊 Experiments & Results

Evaluation Setup

Simulated offline interactive recommendation

Benchmarks:

KuaiRec (Sequential Recommendation / User Simulation)
Kwaishou (Sequential Recommendation / User Simulation)

Metrics:

Long-term User Satisfaction (Cumulative Reward)
Content Diversity
Recommendation Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper text claims LERL significantly improves long-term user satisfaction and outperforms state-of-the-art baselines. However, specific numeric tables were not included in the provided text snippet. The abstract and introduction make qualitative claims of superiority.

Experiment Figures

Example of the prompt structure used for the High-Level Actor (Action Selection).

Example of the prompt structure used for the High-Level Critic (Reflection Generation).

Main Takeaways

LERL effectively mitigates filter bubbles by proactively introducing semantically diverse categories via the LLM planner.
The hierarchical structure allows the system to balance exploration (via high-level category selection) and exploitation (via low-level item optimization).
Textual reflections provide a more semantic and actionable signal for policy improvement than scalar rewards alone.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDP, Policy Gradient, Actor-Critic)
Recommender Systems (Matrix Factorization, Embeddings)
Large Language Models (Prompting, In-context learning)

Key Terms

IRS: Interactive Recommender Systems—systems that adapt recommendations in real-time based on user feedback

Filter Bubble: A state of intellectual isolation where a user is exposed only to content that aligns with their existing preferences, excluding diverse viewpoints

PPO: Proximal Policy Optimization—a reinforcement learning algorithm used here to train the low-level policy learner

MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker

Semantic Planning: High-level decision making focused on broad categories or topics rather than specific items

Reflection Pool: A memory bank storing textual critiques/summaries of past user sessions, used to prompt the LLM for better future planning

Gaussian distribution: A continuous probability distribution used here to sample virtual item embeddings for exploration

Transformer: A neural network architecture using self-attention, used here to encode user interaction history

Soft filter: A mechanism to prioritize items from selected categories without strictly forbidding others, or strictly enforcing the category mask over the item space