← Back to Paper List

LLM-as-a-Judge: Toward World Models for Slate Recommendation Systems

Baptiste Bonin, Maxime Heuillet, Audrey Durand
Université Laval, Mila - Quebec AI Institute
arXiv (2025)
Recommendation P13N Benchmark

📝 Paper Summary

Offline Evaluation User Simulation / World Models
Large Language Models can effectively simulate user preferences for slate recommendations by acting as pairwise judges that predict which of two ordered item lists a user would prefer.
Core Problem
Offline evaluation of slate recommenders is difficult because historical logs have limited coverage, and existing simulators often focus on item-level metrics rather than the utility of ordered sequences (slates).
Why it matters:
  • Recommender systems increasingly suggest ordered sequences (slates) like playlists or news feeds, where order impacts utility
  • Evaluating unseen slates without live users is challenging; current item-wise metrics fail to capture the holistic value of a slate
  • Existing simulators model click dynamics but lack a high-level framework for reasoning about entire slates
Concrete Example: A user might like song A and song B individually, but a playlist (slate) starting with A might be preferred over one starting with B due to mood flow. An item-wise evaluator sees both as equal, failing to capture the user's preference for the specific ordering.
Key Novelty
Evaluator-Centric World Models via Pairwise Slate Comparison
  • Frames the simulation of user utility not as predicting absolute ratings, but as a pairwise classification task where an LLM judges which of two candidate slates is better
  • Introduces a coherence validation protocol that checks if the LLM's preferences satisfy logical axioms like transitivity (if A>B and B>C, then A>C) and asymmetry
Evaluation Highlights
  • Lower empirical regret correlates with higher logical coherence (transitivity/asymmetry), validating that LLMs with consistent internal reasoning better approximate user utility
  • LLMs consistently outperform random baselines in slate recommendation tasks (Task 3), particularly where slate similarity is lower
  • Sequence ordering (Task 2) proves difficult for LLMs, with performance clustering near random when slates differ only by permutation
Breakthrough Assessment
7/10
Provides a solid methodological framework for using LLMs as slate evaluators. While performance on strict ordering is mixed, the correlation between coherence and regret is a valuable insight for building better world models.
×