Lusifer: LLM-based User SImulated Feedback Environment for online Recommender systems

📝 Paper Summary

User Simulation Generative Agents

Lusifer is a simulation environment that uses Large Language Models to generate dynamic, explainable user feedback and evolving profiles for training reinforcement learning recommender systems.

Core Problem

Traditional RL recommender training relies on static datasets that fail to capture evolving user preferences, while existing simulators lack realism or require complex, domain-specific hand-crafting.

Why it matters:

Live user experiments are costly, risky, and ethically constrained, limiting the testing of new RL policies
Static datasets (offline RL) suffer from distribution shifts and cannot simulate how a user reacts to a sequence of new recommendations
Existing simulators like RecSim NG or Recogym either lack semantic realism or are too computationally complex to scale easily

Concrete Example: A user might initially like 'Action' movies but shift towards 'Sci-Fi' after watching a specific series. A static dataset only records the final ratings. Lusifer simulates the *transition* after each batch of movies, explaining explicitly via text why the preference shifted.

Key Novelty

Incremental LLM-based User Profiling

Processes user history in small sequential batches (e.g., 10 movies), prompting an LLM to update a textual summary of preferences after each batch
Generates ratings for new items based strictly on this evolving textual profile and item metadata (overviews/tags), rather than collaborative filtering vectors
Provides natural language explanations for *why* a user's profile changed or why a specific rating was given

Architecture

The two-phase pipeline: Profile Creation and Rating Generation

Evaluation Highlights

Outperforms neural baselines in cold-start scenarios: Gemma:12B achieves 1.18 RMSE vs NCF's 1.29 on MovieLens 100K for users with <10 interactions
Demonstrates capability to simulate feedback using only the last 40 interactions (approx. 30% of user history), reducing data reliance
Provides interpretable justification for every profile update, unlike latent-factor baselines (SVD++, ALS) which are black boxes

Breakthrough Assessment

6/10

Novel application of LLMs for incremental user modeling in simulations. While it doesn't beat baselines on general accuracy, its explainability and cold-start performance make it a valuable tool for RL evaluation.

⚙️ Technical Details

Problem Definition

Setting: Simulating user ratings and profile evolution in an online recommendation environment using limited interaction history

Inputs: Sequence of recent user interactions (movies + ratings + metadata) and a target candidate item

Outputs: Simulated integer rating (1-5) and an updated textual user profile summary

Pipeline Flow

Data Preprocessing: Enrich MovieLens with TMDB textual metadata
Phase 1: Incremental Profile Creation (Batch Processing)
Phase 2: Simulated Rating Generation

System Modules

Profile Generator

Reads a batch of interactions and updates the user's textual behavioral summary

Model or implementation: GPT-4o-mini or Gemma (via Ollama)

Rating Simulator

Predicts a rating for a new item based on the generated profile

Model or implementation: GPT-4o-mini or Gemma (via Ollama)

Novel Architectural Elements

Sequential batch-based profile updating: The user state is not static but a function of the previous state textual summary plus the newest batch of interactions
Hybrid input reliance: Discards older history (keeping only last 40) to force the model to rely on recent semantic signals rather than long-term collaborative patterns

Modeling

Base Model: GPT-4o-mini (OpenAI API), Gemma:3B, Gemma:12B (via Ollama)

Key Hyperparameters:

history_length: Last 40 interactions
batch_size: 10 interactions per profile update
output_range: Integer 1-5

Compute: Inference only. Supports local execution via Ollama for scalability.

Comparison to Prior Work

vs. RecSim/RecoGym: Lusifer uses natural language to model user state, allowing for zero-shot generalization to new items via metadata
vs. SVD++/NCF (Baselines): Lusifer is a *simulator* rather than just a predictor; it explains *why* a rating is given, whereas baselines only output a score
vs. Agent4Rec [not cited in paper]: Agent4Rec uses generative agents for simulation but often processes full history; Lusifer emphasizes incremental updates on recent history

Limitations

Lower predictive accuracy (RMSE) compared to traditional collaborative filtering on general benchmarks
Reliance on LLM API costs or local inference latency for large-scale simulations
Experiments limited to integer ratings (1-5) without decimals, potentially reducing granularity
Evaluation restricted to MovieLens dataset; multi-domain applicability not empirically tested

Reproducibility

Code: https://github.com/danialebrat/Lusifer

📊 Experiments & Results

Evaluation Setup

Predicting ratings for held-out test items using simulated user profiles derived from the last 40 training interactions

Benchmarks:

MovieLens 100K (Rating Prediction / User Simulation)
MovieLens 1M (Rating Prediction / User Simulation)

Metrics:

RMSE (Root Mean Squared Error)
MAE (Mean Absolute Error)
Pearson Correlation
Statistical methodology: 5-fold cross-validation

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
General accuracy comparison on MovieLens 100K shows traditional baselines outperform LLM-based simulation in standard predictive accuracy.
MovieLens 100K	RMSE	1.05	1.57	+0.52
MovieLens 100K	RMSE	1.05	1.19	+0.14
Cold start scenarios (users with <10 interactions) show Lusifer outperforming several neural and matrix factorization baselines.
MovieLens 100K	RMSE	1.29	1.18	-0.11
MovieLens 100K	RMSE	1.35	1.18	-0.17

Main Takeaways

Lusifer excels in cold-start scenarios where interaction data is sparse, leveraging textual metadata (movie overviews) to infer preferences where collaborative filtering fails
Including explicit numeric ratings in the LLM prompt context sometimes *reduced* accuracy compared to relying on textual descriptions, suggesting LLMs struggle with numerical regression reasoning
While not state-of-the-art in general prediction accuracy, Lusifer successfully generates *explainable* updates, making it a distinct tool for debugging and interpreting RL agent policies
Open-source models (Gemma:12B) surprisingly outperformed GPT-4o-mini in several rating prediction tasks within this framework

📚 Prerequisite Knowledge

Prerequisites

Recommender Systems (Collaborative Filtering)
Large Language Models (Prompt Engineering)
Reinforcement Learning context

Key Terms

Cold Start: The scenario where the system has little or no prior interaction data for a user or item, making prediction difficult

RMSE: Root Mean Squared Error—a standard metric measuring the average magnitude of the prediction errors

SVD++: Singular Value Decomposition++—a matrix factorization algorithm that incorporates implicit feedback

NCF: Neural Collaborative Filtering—a deep learning framework replacing the dot product of matrix factorization with a neural architecture

LLM: Large Language Model—generative AI models like GPT-4 or Gemma used here to simulate human reasoning

RL: Reinforcement Learning—training agents to make sequences of decisions (recommendations) to maximize cumulative reward

One-shot prompting: Providing the LLM with a single example of the desired input-output pair within the prompt to guide its generation