← Back to Paper List

RecToM: A Benchmark for Evaluating Machine Theory of Mind in LLM-based Conversational Recommender Systems

Mengfan Li, Xuanhua Shi, Yang Deng
arXiv (2025)
Benchmark Recommendation Reasoning P13N

📝 Paper Summary

Conversational Recommender Systems (CRS) Theory of Mind (ToM) in LLMs Benchmark Construction
RecToM is a benchmark for assessing Theory of Mind in conversational recommender systems, evaluating whether LLMs can infer users' complex mental states and use them to predict effective dialogue strategies.
Core Problem
Existing Theory of Mind benchmarks rely on simplistic synthetic narratives (like Sally-Anne tests) or retrospective reasoning, failing to capture the complex, dynamic, and strategic nature of real-world recommendation dialogues.
Why it matters:
  • Effective recommendation requires understanding subtle user preferences and intentions, not just physical object tracking
  • Current benchmarks overlook 'Behavioral Prediction'—the ability to use inferred mental states to guide future actions, which is critical for proactive recommender systems
  • LLMs often generate sycophantic responses rather than grounded reasoning, leading to suboptimal user experiences in CRS
Concrete Example: A seeker might reject a movie but imply a latent preference (e.g., 'I hate this actor, but I love the genre'). A model failing ToM might miss the genre preference or fail to predict that the next best move is to recommend a different movie in the same genre, instead just apologizing vacantly.
Key Novelty
RecToM: Realistic CRS-Specific Theory of Mind Benchmark
  • Shifts evaluation from simple physical state tracking to complex psychological reasoning in asymmetric social roles (Recommender vs. Seeker)
  • Decomposes ToM into two complementary dimensions: Cognitive Inference (understanding 'what'—beliefs, desires, intentions) and Behavioral Prediction (understanding 'what next'—strategy selection and judgment)
  • Introduces multi-granular and multi-dimensional annotations, such as distinguishing between coarse/fine-grained intentions and analyzing beliefs across suggestion, seen, and liked dimensions
Architecture
Architecture Figure Figure 1
Overview of the RecToM benchmark framework, illustrating the two main dimensions: Cognitive Inference (Desire, Belief, Intention) and Behavioral Prediction (Prediction, Judgment).
Evaluation Highlights
  • LLMs exhibit significantly lower accuracy on multiple-choice questions compared to binary/single-choice, struggling as option complexity increases
  • Performance drops notably on fine-grained intention classification compared to coarse-grained tasks, indicating a bottleneck in modeling nuanced user goals
  • LLMs show a systematic bias towards 'sycophantic' responses, often agreeing with perceived preferences rather than making objective effectiveness judgments
Breakthrough Assessment
8/10
Significantly advances ToM evaluation by moving beyond toy problems to realistic, complex recommender scenarios. The focus on behavioral prediction bridges the gap between understanding and acting.
×