SUBER: An RL Environment with Simulated Human Behavior for Recommender Systems

📝 Paper Summary

Reinforcement Learning for Recommender Systems (RL4Rec) User Simulation Synthetic Environments

SUBER is a modular reinforcement learning environment that uses Large Language Models to simulate human users, their preferences, and rating behaviors, enabling the training of recommender systems without expensive online human interaction.

Core Problem

Training RL-based recommender systems requires massive online data, but experimenting on real users risks them abandoning the platform due to poor initial recommendations, while offline data is static and biased.

Why it matters:

Real-world data collection is expensive and risky because bad recommendations degrade user experience (exploration costs)
Offline evaluation metrics often fail to correlate with real-world performance
Existing simulators (like RecoGym or RecSim) rely on simpler mathematical models that lack the semantic understanding and behavioral complexity of human users

Concrete Example: In a movie recommendation setting, a standard RL agent needs feedback to learn. If trained on real users, it might recommend random irrelevant movies to explore, causing users to quit. SUBER allows the agent to make these mistakes on a synthetic LLM-based user first, receiving realistic ratings (e.g., 1-10 stars) based on a generated persona.

Key Novelty

LLM-based User Simulation for RL Environments

Replaces mathematical user models with Large Language Models (LLMs) that act as synthetic users, predicting how a specific persona would rate a given item based on history
Introduces modular components to simulate specific human behaviors like 'Concept Drift' (evolving interests) and 'Fleeting Interests' (spontaneous decisions) via reward perturbation and shaping

Architecture

The overall architecture of the SUBER framework, illustrating the interaction loop between the RL Agent and the Synthetic Environment.

Evaluation Highlights

Successfully replicates rating distributions of real-world datasets (MovieLens and Amazon Books) using synthetic LLM users
Demonstrates that LLM agents can maintain consistent genre preferences (high ratings for liked genres, low for disliked) across interactions
Validates the ability to use historical interaction data to predict future ratings for item sequences (e.g., movie series like James Bond)

Breakthrough Assessment

7/10

A strong application of LLMs as simulators rather than recommenders. While the concept of LLM agents is known, wrapping it into a standardized Gym environment for RL4Rec addresses a major pain point in the field (lack of online simulators).

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning Environment where the Agent is the Recommender System and the Environment is the Simulated User.

Inputs: Action (Item recommended by the RL agent)

Outputs: Reward (Simulated Rating generated by the LLM based on User Persona and History)

Pipeline Flow

Group: Memory Retrieval -> Preprocessing
Group: LLM Prediction -> Generation
Group: Reward Adjustment -> Postprocessing

System Modules

Item Retrieval

Selects relevant past interactions from the user's history to fit within the LLM's context window

Model or implementation: Sentence-T5 (for similarity retrieval) or Sorensen Coefficient (for feature retrieval)

Prompt Constructor

Aggregates user profile, retrieved history, and query item into a natural language prompt

Model or implementation: Rule-based templates

User Simulator

Predicts the rating the synthetic user would give to the recommended item

Model or implementation: Llama-2, Vicuna, or Mistral (Quantized via GPTQ)

Reward Perturbation/Shaping

Modifies the raw LLM rating to simulate noise, concept drift, or diminishing interest in repeated items

Model or implementation: Mathematical functions (Gaussian/Greedy noise, Time-decay formulas)

Novel Architectural Elements

Separation of Reward Perturbation (stored in memory to simulate permanent drift) and Reward Shaping (returned to agent but not stored, simulating transient mood)
Integration of an LLM as a dynamic Gym environment rather than a static dataset or mathematical formula

Modeling

Base Model: Variants of Llama-2, Vicuna (v1.3/v1.5), and Mistral-7B

Key Hyperparameters:

quantization: 4-bit (GPTQ)
movie_rating_scale: 1-10
book_rating_scale: 1-5
+ 2 more
greedy_noise_probability_q: Not explicitly reported in the paper (variable parameter)
context_limit: Dependent on specific LLM (e.g., 4096 for Llama-2)

Compute: All models run within a 24GB memory limit (consumer GPU).

Comparison to Prior Work

vs. RecoGym/RecSim: SUBER uses natural language (LLMs) to simulate behavior, allowing for semantic reasoning rather than just statistical correlations
vs. VirtualTaobao: SUBER is not dataset-dependent and can simulate users in any domain where an LLM has knowledge (movies, books)
vs. LLM-as-Recommender (Wang et al.): SUBER uses the LLM as the *Environment* (User), not the *Agent* (Recommender)

Limitations

Inference speed of LLMs is significantly slower than mathematical simulators, slowing down RL training loops
Performance depends heavily on the underlying LLM's knowledge of the specific items (movies/books)
Tokenization ambiguity for numbers (e.g., '10' vs '1' '0') requires specific handling/shifting of rating scales

Reproducibility

Code: https://github.com/SUBER-Team/SUBER

Code is publicly available at https://github.com/SUBER-Team/SUBER. The framework is open-source. User profiles are generated using synthetic templates (age, hobbies). Datasets used are MovieLens (ml-latest-small) and Amazon Book Dataset.

📊 Experiments & Results

Evaluation Setup

Synthetic environment validation (not competing on leaderboard metrics, but validating simulator fidelity)

Benchmarks:

Movie Recommendation (Rating Prediction (1-10)) [New]
Book Recommendation (Rating Prediction (1-5)) [New]

Metrics:

Total Variation Distance (TVD) between synthetic and real rating distributions
Rating Accuracy (on genre preference tests)
Statistical methodology: Comparison of empirical distributions using TVD.

Main Takeaways

LLMs can effectively act as synthetic users that respect defined personas (e.g., consistently rating liked genres higher than disliked ones).
The framework successfully simulates 'concept drift' and 'fleeting interests' through modular reward perturbation, adding realism beyond static datasets.
Synthetic rating distributions closely match real-world distributions (MovieLens), validating the simulator's fidelity.
There is a trade-off between model size and simulation speed; smaller quantized models (7B) are faster for RL loops but may have lower fidelity than larger models.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL)
Recommender Systems
Large Language Models (LLMs)
In-context Learning

Key Terms

RL4Rec: Reinforcement Learning for Recommender Systems—using RL agents to dynamically select items for users to maximize long-term engagement

Concept Drift: The phenomenon where the statistical properties of the target variable (user preferences) change over time

Gymnasium: A standard API for reinforcement learning environments (formerly OpenAI Gym)

GPTQ: A quantization technique to compress LLMs (e.g., to 4 bits) for faster inference with minimal performance loss

Reward Shaping: Modifying the reward signal returned to the agent to guide learning, often used here to simulate fleeting interests without altering long-term memory

Sorensen Coefficient: A statistic used for comparing the similarity of two samples, used here for feature-based retrieval

Sentence-T5: A transformer model trained to generate sentence embeddings, used here for similarity-based item retrieval