Evaluation Setup
Simulation of 1,000 agents interacting with recommenders initialized from MovieLens-1M, Steam, and Amazon-Book datasets.
Benchmarks:
- MovieLens-1M (Movie Recommendation)
- Steam (Game Recommendation)
- Amazon-Book (Book Recommendation)
Metrics:
- MAE (Mean Absolute Error) of rating prediction
- MSE (Mean Squared Error) of rating prediction
- RMSE (Root Mean Squared Error) of rating prediction
- Correlation (Spearman/Pearson) between agent and human rating distributions
- Statistical methodology: Spearman and Pearson correlation coefficients reported to measure alignment.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Rating alignment experiments demonstrate that Agent4Rec agents can reproduce the rating patterns of real users with reasonable accuracy. |
| MovieLens-1M |
MAE |
1.187 |
0.768 |
-0.419
|
| MovieLens-1M |
Spearman Correlation (Rating Distribution) |
0.0 |
0.638 |
+0.638
|
| Recommender evaluation via simulation shows that Neural Graph approaches generally perform best, aligning with offline expectations. |
| MovieLens-1M (Simulation) |
Average Rating |
3.31 |
4.12 |
+0.81
|
| MovieLens-1M (Simulation) |
Engagement (Pages Viewed) |
2.44 |
4.21 |
+1.77
|
Main Takeaways
- Agents exhibit realistic rating behaviors: The rating distribution generated by agents closely mirrors the ground truth (e.g., matching the Gaussian-like distribution of MovieLens ratings).
- Filter Bubble confirmation: In simulation, high-performing algorithms like MF and LightGCN tend to reduce the diversity of genres shown to users over time compared to random or popularity-based baselines.
- Causal Discovery: The simulator data allows for the recovery of causal relationships (e.g., Activity -> Click Count), providing a new way to validate the logic of recommendation data generation.