← Back to Paper List

Large Language Models are Learnable Planners for Long-Term Recommendation

Wentao Shi, Xiangnan He, Yang Zhang, Chongming Gao, Xinyue Li, Jizhi Zhang, Qifan Wang, Fuli Feng
University of Science and Technology of China, Meta AI
arXiv (2024)
Recommendation Agent Memory RL

πŸ“ Paper Summary

LLM-based Recommendation Interactive Recommendation
BiLLP improves long-term user engagement in recommendation by using LLMs for hierarchical planning, splitting learning into macro-level reflection on principles and micro-level personalized action adjustments.
Core Problem
Traditional RL-based recommenders struggle with data sparsity and instability when learning long-term planning from scratch, while standard LLM recommenders lack mechanisms to learn from interaction feedback for long-term engagement.
Why it matters:
  • Greedy recommendation strategies (optimizing clicks) trap users in filter bubbles and echo chambers, hurting long-term retention
  • Existing RL methods require massive data to learn stable policies, failing on sparse or long-tail items
  • Directly applying LLMs lacks personalization and specific 'commonsense' about long-term engagement principles
Concrete Example: A greedy model might repeatedly recommend similar action games to a user, maximizing immediate clicks but causing boredom and eventual churn (filter bubble). BiLLP's planner detects this pattern, reflects that 'repetitive items cause withdrawal', and plans to diversify genres, extending the interaction.
Key Novelty
Bi-level Learnable LLM Planning (BiLLP)
  • Decomposes recommendation into two loops: Macro-learning (Planner + Reflector) learns high-level principles like 'diversity keeps users engaged', while Micro-learning (Actor + Critic) grounds these into specific item choices.
  • Uses retrieval-based memory instead of gradient updates to 'learn': the Critic estimates value functions via in-context learning, and the Planner retrieves past reflections to guide future thoughts.
Evaluation Highlights
  • Outperforms state-of-the-art RL methods (DORL, CIRS) on long-term engagement metrics (cumulative reward, interaction depth) across Steam and Amazon-Book datasets.
  • Achieves higher cumulative rewards than standard LLM baselines (ChatGPT, prompt-tuning) by effectively utilizing hierarchical planning and reflection.
  • Critic module provides lower variance value estimation compared to standard RL, stabilizing the learning process.
Breakthrough Assessment
7/10
Novel application of hierarchical LLM agents to the specific problem of long-term recommendation. Successfully replaces gradient-based RL with in-context memory updates for policy improvement.
×