Nathan Corecco, Giorgio Piatti, Luca A. Lanzendörfer, Flint Xiaofeng Fan, Roger Wattenhofer
arXiv
(2024)
RLRecommendationMemoryP13N
📝 Paper Summary
Reinforcement Learning for Recommender Systems (RL4Rec)User SimulationSynthetic Environments
SUBER is a modular reinforcement learning environment that uses Large Language Models to simulate human users, their preferences, and rating behaviors, enabling the training of recommender systems without expensive online human interaction.
Core Problem
Training RL-based recommender systems requires massive online data, but experimenting on real users risks them abandoning the platform due to poor initial recommendations, while offline data is static and biased.
Why it matters:
Real-world data collection is expensive and risky because bad recommendations degrade user experience (exploration costs)
Offline evaluation metrics often fail to correlate with real-world performance
Existing simulators (like RecoGym or RecSim) rely on simpler mathematical models that lack the semantic understanding and behavioral complexity of human users
Concrete Example:In a movie recommendation setting, a standard RL agent needs feedback to learn. If trained on real users, it might recommend random irrelevant movies to explore, causing users to quit. SUBER allows the agent to make these mistakes on a synthetic LLM-based user first, receiving realistic ratings (e.g., 1-10 stars) based on a generated persona.
Key Novelty
LLM-based User Simulation for RL Environments
Replaces mathematical user models with Large Language Models (LLMs) that act as synthetic users, predicting how a specific persona would rate a given item based on history
Introduces modular components to simulate specific human behaviors like 'Concept Drift' (evolving interests) and 'Fleeting Interests' (spontaneous decisions) via reward perturbation and shaping
Architecture
The overall architecture of the SUBER framework, illustrating the interaction loop between the RL Agent and the Synthetic Environment.
Evaluation Highlights
Successfully replicates rating distributions of real-world datasets (MovieLens and Amazon Books) using synthetic LLM users
Demonstrates that LLM agents can maintain consistent genre preferences (high ratings for liked genres, low for disliked) across interactions
Validates the ability to use historical interaction data to predict future ratings for item sequences (e.g., movie series like James Bond)
Breakthrough Assessment
7/10
A strong application of LLMs as simulators rather than recommenders. While the concept of LLM agents is known, wrapping it into a standardized Gym environment for RL4Rec addresses a major pain point in the field (lack of online simulators).
⚙️ Technical Details
Problem Definition
Setting: Reinforcement Learning Environment where the Agent is the Recommender System and the Environment is the Simulated User.
Inputs: Action (Item recommended by the RL agent)
Outputs: Reward (Simulated Rating generated by the LLM based on User Persona and History)
Pipeline Flow
Group: Memory Retrieval -> Preprocessing
Group: LLM Prediction -> Generation
Group: Reward Adjustment -> Postprocessing
System Modules
Item Retrieval
Selects relevant past interactions from the user's history to fit within the LLM's context window
Model or implementation: Sentence-T5 (for similarity retrieval) or Sorensen Coefficient (for feature retrieval)
Prompt Constructor
Aggregates user profile, retrieved history, and query item into a natural language prompt
Model or implementation: Rule-based templates
User Simulator
Predicts the rating the synthetic user would give to the recommended item
Model or implementation: Llama-2, Vicuna, or Mistral (Quantized via GPTQ)
Reward Perturbation/Shaping
Modifies the raw LLM rating to simulate noise, concept drift, or diminishing interest in repeated items
Model or implementation: Mathematical functions (Gaussian/Greedy noise, Time-decay formulas)
Novel Architectural Elements
Separation of Reward Perturbation (stored in memory to simulate permanent drift) and Reward Shaping (returned to agent but not stored, simulating transient mood)
Integration of an LLM as a dynamic Gym environment rather than a static dataset or mathematical formula
Modeling
Base Model: Variants of Llama-2, Vicuna (v1.3/v1.5), and Mistral-7B
Code is publicly available at https://github.com/SUBER-Team/SUBER. The framework is open-source. User profiles are generated using synthetic templates (age, hobbies). Datasets used are MovieLens (ml-latest-small) and Amazon Book Dataset.
📊 Experiments & Results
Evaluation Setup
Synthetic environment validation (not competing on leaderboard metrics, but validating simulator fidelity)
Benchmarks:
Movie Recommendation (Rating Prediction (1-10)) [New]
Book Recommendation (Rating Prediction (1-5)) [New]
Metrics:
Total Variation Distance (TVD) between synthetic and real rating distributions
Rating Accuracy (on genre preference tests)
Statistical methodology: Comparison of empirical distributions using TVD.
Main Takeaways
LLMs can effectively act as synthetic users that respect defined personas (e.g., consistently rating liked genres higher than disliked ones).
The framework successfully simulates 'concept drift' and 'fleeting interests' through modular reward perturbation, adding realism beyond static datasets.
Synthetic rating distributions closely match real-world distributions (MovieLens), validating the simulator's fidelity.
There is a trade-off between model size and simulation speed; smaller quantized models (7B) are faster for RL loops but may have lower fidelity than larger models.
📚 Prerequisite Knowledge
Prerequisites
Reinforcement Learning (RL)
Recommender Systems
Large Language Models (LLMs)
In-context Learning
Key Terms
RL4Rec: Reinforcement Learning for Recommender Systems—using RL agents to dynamically select items for users to maximize long-term engagement
Concept Drift: The phenomenon where the statistical properties of the target variable (user preferences) change over time
Gymnasium: A standard API for reinforcement learning environments (formerly OpenAI Gym)
GPTQ: A quantization technique to compress LLMs (e.g., to 4 bits) for faster inference with minimal performance loss
Reward Shaping: Modifying the reward signal returned to the agent to guide learning, often used here to simulate fleeting interests without altering long-term memory
Sorensen Coefficient: A statistic used for comparing the similarity of two samples, used here for feature-based retrieval
Sentence-T5: A transformer model trained to generate sentence embeddings, used here for similarity-based item retrieval