BiLLP: Bi-level Learnable LLM Planningβthe proposed framework splitting planning into macro and micro levels
Macro-learning: The process where the Reflector analyzes full episodes to extract high-level principles (reflections) for the Planner
Micro-learning: The process where the Critic evaluates specific actions to update the Actor's item-selection policy
Reflector: An LLM module that analyzes finished interaction trajectories to diagnose failure (e.g., user quitting) and propose principles
Planner: An LLM module that generates high-level 'thoughts' or sub-plans based on the current state and retrieved reflections
Actor: An LLM module that converts high-level thoughts into specific item recommendations (actions), using tools and memory
Critic: An LLM module that estimates the long-term value (advantage) of an action to guide the Actor
Advantage function: A measure of how much better a specific action is compared to the average action in a given state
Filter bubble: A situation where a user is only exposed to information/items they already like, potentially leading to boredom or narrow focus
Simulated Environment: An offline environment constructed from real user logs to mimic online user feedback for training RL models