Large Language Models are Learnable Planners for Long-Term Recommendation

📝 Paper Summary

LLM-based Recommendation Interactive Recommendation

BiLLP improves long-term user engagement in recommendation by using LLMs for hierarchical planning, splitting learning into macro-level reflection on principles and micro-level personalized action adjustments.

Core Problem

Traditional RL-based recommenders struggle with data sparsity and instability when learning long-term planning from scratch, while standard LLM recommenders lack mechanisms to learn from interaction feedback for long-term engagement.

Why it matters:

Greedy recommendation strategies (optimizing clicks) trap users in filter bubbles and echo chambers, hurting long-term retention
Existing RL methods require massive data to learn stable policies, failing on sparse or long-tail items
Directly applying LLMs lacks personalization and specific 'commonsense' about long-term engagement principles

Concrete Example: A greedy model might repeatedly recommend similar action games to a user, maximizing immediate clicks but causing boredom and eventual churn (filter bubble). BiLLP's planner detects this pattern, reflects that 'repetitive items cause withdrawal', and plans to diversify genres, extending the interaction.

Key Novelty

Bi-level Learnable LLM Planning (BiLLP)

Decomposes recommendation into two loops: Macro-learning (Planner + Reflector) learns high-level principles like 'diversity keeps users engaged', while Micro-learning (Actor + Critic) grounds these into specific item choices.
Uses retrieval-based memory instead of gradient updates to 'learn': the Critic estimates value functions via in-context learning, and the Planner retrieves past reflections to guide future thoughts.

Evaluation Highlights

Outperforms state-of-the-art RL methods (DORL, CIRS) on long-term engagement metrics (cumulative reward, interaction depth) across Steam and Amazon-Book datasets.
Achieves higher cumulative rewards than standard LLM baselines (ChatGPT, prompt-tuning) by effectively utilizing hierarchical planning and reflection.
Critic module provides lower variance value estimation compared to standard RL, stabilizing the learning process.

Breakthrough Assessment

7/10

Novel application of hierarchical LLM agents to the specific problem of long-term recommendation. Successfully replaces gradient-based RL with in-context memory updates for policy improvement.

⚙️ Technical Details

Problem Definition

Setting: Interactive Recommendation modeled as a Markov Decision Process (MDP) where an agent interacts with a simulated user environment

Inputs: User interaction history H, current state s_n

Outputs: Recommended item i (Action a_n)

Pipeline Flow

Macro-level: Reflector (Episode End) → Memory → Planner (Start of Step)
Micro-level: Planner (Thought) → Actor (Action) → Environment (Reward) → Critic (Advantage) → Memory

System Modules

Reflector (Macro-learning)

Analyze completed episode to generate high-level guiding principles (reflections)

Model or implementation: LLM instance (e.g., GPT-3.5-turbo or Llama2)

Planner (Macro-learning)

Generate high-level 'thought' (sub-plan) for the current step

Model or implementation: Frozen LLM instance + Memory Bank

Actor (Micro-learning)

Ground the thought into a specific item recommendation

Model or implementation: LLM instance + Memory Bank + Tool Library

Critic (Micro-learning)

Evaluate the action to provide feedback for future Actor updates

Model or implementation: LLM instance + Memory Bank

Novel Architectural Elements

Bi-level hierarchy: Splitting the recommendation agent into a high-level Planner (strategy) and low-level Actor (execution)
Memory-based Policy Update: Replacing gradient descent with updating an external memory of (state, action, value/reflection) tuples for in-context retrieval
LLM-based Critic: Using an LLM to estimate state-values V(s) via retrieval of similar past states rather than a trained neural value network

Modeling

Base Model: Llama2-7b-chat or GPT-3.5-turbo (depending on experiment variant)

Training Method: In-context learning / Memory retrieval updates (No gradient updates to LLM weights)

Objective Functions:

Purpose: Calculate Advantage for Critic update.

Formally: v_n = r_n + gamma * V(s_{n+1}) - V(s_n)
Purpose: Retrieve relevant memories.

Formally: minimize Euclidean distance between current state embedding and stored state embeddings

Key Hyperparameters:

inference_temperature: Non-zero (to ensure exploration)
retrieval_k: Not explicitly reported in the paper
discount_factor_gamma: Standard RL gamma (implied, exact value not in text)

Compute: Not reported in the paper

Comparison to Prior Work

vs. DORL/CIRS: BiLLP uses LLM reasoning and textual plans rather than black-box neural networks; BiLLP learns via memory retrieval rather than gradient descent.
vs. Chat-Rec: Chat-Rec focuses on immediate interaction; BiLLP explicitly plans for *long-term* rewards using a Critic and Planner.
vs. Reflexion [not cited in paper]: Similar reflection mechanism, but BiLLP applies it hierarchically (Macro vs Micro) specifically for recommendation MDPs.

Limitations

Relies on simulated environments for evaluation, which may not perfectly capture real user behavior.
Inference cost is likely high due to multiple LLM calls (Planner, Actor, Critic) per step.
Constrained to recommending one item per action in current implementation.

Reproducibility

Code: https://github.com/jizhi-zhang/BiLLP

Code is publicly available at https://github.com/jizhi-zhang/BiLLP. The paper uses simulated environments (Steam, Amazon-Book) which are constructed from public datasets. Specific prompt templates are provided in the paper (Table 4.1.1).

📊 Experiments & Results

Evaluation Setup

Interactive recommendation in simulated environments built from Steam and Amazon-Book datasets.

Benchmarks:

Steam (Game Recommendation)
Amazon-Book (Book Recommendation)

Metrics:

Cumulative Reward (R_cum)
Interaction Depth (Depth) - average trajectory length
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main performance comparison shows BiLLP outperforms both RL and LLM baselines on the Steam dataset.
Steam	Cumulative Reward	46.21	60.54	+14.33
Steam	Interaction Depth	13.45	17.02	+3.57
Performance on Amazon-Book dataset confirms generalization.
Amazon-Book	Cumulative Reward	23.58	27.45	+3.87
Ablation studies demonstrate the necessity of both Macro (Planner) and Micro (Critic) components.
Steam	Cumulative Reward	53.12	60.54	+7.42
Steam	Cumulative Reward	48.33	60.54	+12.21

Main Takeaways

BiLLP consistently achieves higher cumulative rewards and interaction depths than state-of-the-art RL (DORL, CIRS) and LLM baselines.
Both macro-learning (strategic reflection) and micro-learning (action valuation) are critical; removing either leads to performance drops.
The LLM-based Critic provides stable value estimation, mitigating high variance issues seen in traditional RL when data is sparse.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDP, Actor-Critic, Advantage function)
Large Language Models (In-context learning, Prompting)
Recommender Systems (Interactive setting)

Key Terms

BiLLP: Bi-level Learnable LLM Planning—the proposed framework splitting planning into macro and micro levels

Macro-learning: The process where the Reflector analyzes full episodes to extract high-level principles (reflections) for the Planner

Micro-learning: The process where the Critic evaluates specific actions to update the Actor's item-selection policy

Reflector: An LLM module that analyzes finished interaction trajectories to diagnose failure (e.g., user quitting) and propose principles

Planner: An LLM module that generates high-level 'thoughts' or sub-plans based on the current state and retrieved reflections

Actor: An LLM module that converts high-level thoughts into specific item recommendations (actions), using tools and memory

Critic: An LLM module that estimates the long-term value (advantage) of an action to guide the Actor

Advantage function: A measure of how much better a specific action is compared to the average action in a given state

Filter bubble: A situation where a user is only exposed to information/items they already like, potentially leading to boredom or narrow focus

Simulated Environment: An offline environment constructed from real user logs to mimic online user feedback for training RL models