Evaluation Setup
Predicting ratings for held-out test items using simulated user profiles derived from the last 40 training interactions
Benchmarks:
- MovieLens 100K (Rating Prediction / User Simulation)
- MovieLens 1M (Rating Prediction / User Simulation)
Metrics:
- RMSE (Root Mean Squared Error)
- MAE (Mean Absolute Error)
- Pearson Correlation
- Statistical methodology: 5-fold cross-validation
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| General accuracy comparison on MovieLens 100K shows traditional baselines outperform LLM-based simulation in standard predictive accuracy. |
| MovieLens 100K |
RMSE |
1.05 |
1.57 |
+0.52
|
| MovieLens 100K |
RMSE |
1.05 |
1.19 |
+0.14
|
| Cold start scenarios (users with <10 interactions) show Lusifer outperforming several neural and matrix factorization baselines. |
| MovieLens 100K |
RMSE |
1.29 |
1.18 |
-0.11
|
| MovieLens 100K |
RMSE |
1.35 |
1.18 |
-0.17
|
Main Takeaways
- Lusifer excels in cold-start scenarios where interaction data is sparse, leveraging textual metadata (movie overviews) to infer preferences where collaborative filtering fails
- Including explicit numeric ratings in the LLM prompt context sometimes *reduced* accuracy compared to relying on textual descriptions, suggesting LLMs struggle with numerical regression reasoning
- While not state-of-the-art in general prediction accuracy, Lusifer successfully generates *explainable* updates, making it a distinct tool for debugging and interpreting RL agent policies
- Open-source models (Gemma:12B) surprisingly outperformed GPT-4o-mini in several rating prediction tasks within this framework