RecToM: A Benchmark for Evaluating Machine Theory of Mind in LLM-based Conversational Recommender Systems

📝 Paper Summary

Conversational Recommender Systems (CRS) Theory of Mind (ToM) in LLMs Benchmark Construction

RecToM is a benchmark for assessing Theory of Mind in conversational recommender systems, evaluating whether LLMs can infer users' complex mental states and use them to predict effective dialogue strategies.

Core Problem

Existing Theory of Mind benchmarks rely on simplistic synthetic narratives (like Sally-Anne tests) or retrospective reasoning, failing to capture the complex, dynamic, and strategic nature of real-world recommendation dialogues.

Why it matters:

Effective recommendation requires understanding subtle user preferences and intentions, not just physical object tracking
Current benchmarks overlook 'Behavioral Prediction'—the ability to use inferred mental states to guide future actions, which is critical for proactive recommender systems
LLMs often generate sycophantic responses rather than grounded reasoning, leading to suboptimal user experiences in CRS

Concrete Example: A seeker might reject a movie but imply a latent preference (e.g., 'I hate this actor, but I love the genre'). A model failing ToM might miss the genre preference or fail to predict that the next best move is to recommend a different movie in the same genre, instead just apologizing vacantly.

Key Novelty

RecToM: Realistic CRS-Specific Theory of Mind Benchmark

Shifts evaluation from simple physical state tracking to complex psychological reasoning in asymmetric social roles (Recommender vs. Seeker)
Decomposes ToM into two complementary dimensions: Cognitive Inference (understanding 'what'—beliefs, desires, intentions) and Behavioral Prediction (understanding 'what next'—strategy selection and judgment)
Introduces multi-granular and multi-dimensional annotations, such as distinguishing between coarse/fine-grained intentions and analyzing beliefs across suggestion, seen, and liked dimensions

Architecture

Overview of the RecToM benchmark framework, illustrating the two main dimensions: Cognitive Inference (Desire, Belief, Intention) and Behavioral Prediction (Prediction, Judgment).

Evaluation Highlights

LLMs exhibit significantly lower accuracy on multiple-choice questions compared to binary/single-choice, struggling as option complexity increases
Performance drops notably on fine-grained intention classification compared to coarse-grained tasks, indicating a bottleneck in modeling nuanced user goals
LLMs show a systematic bias towards 'sycophantic' responses, often agreeing with perceived preferences rather than making objective effectiveness judgments

Breakthrough Assessment

8/10

Significantly advances ToM evaluation by moving beyond toy problems to realistic, complex recommender scenarios. The focus on behavioral prediction bridges the gap between understanding and acting.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn conversational recommendation dialogue analysis via Question Answering (QA)

Inputs: Dialogue history between a Recommender and a Seeker

Outputs: Selected option (from multiple choices) representing inferred mental states (intention, belief, desire) or predicted behaviors (strategy, effectiveness)

Pipeline Flow

Data Selection (ReDial dataset filtering)
Human Annotation (Belief, Desire, Intention)
Question Generation (Templates for 10 question types)
LLM Evaluation (QA Task)

System Modules

Data Selector (Data Construction)

Filter ReDial dataset for high-quality dialogues with clear acceptance/rejection signals

Model or implementation: Rule-based filtering following IARD protocol

Annotator (Data Construction)

Label dialogues with mental states

Model or implementation: Human Experts (PhD students)

Evaluator

Query LLMs with constructed QA pairs to test ToM capabilities

Model or implementation: Various LLMs (e.g., GPT-4, Llama-3)

Novel Architectural Elements

Hierarchical Intention Schema: Distinguishes between coarse-grained (high-level purpose) and fine-grained (context-dependent sub-intentions) categories
Multi-dimensional Belief Modeling: Decomposes item belief into 'Suggestion' source, 'Seen' status, and 'Liked' status

Comparison to Prior Work

vs. OpenToM: RecToM focuses specifically on Recommendation Dialogues (CRS) rather than general narratives
vs. Hi-ToM: RecToM integrates Behavioral Prediction (action selection) alongside mental state inference
vs. SocialIQA: RecToM models domain-specific CRS mental states like item preference evolution and recommender strategy [not cited in paper]

Limitations

Reliance on a single base dataset (ReDial) which focuses only on movie recommendations
Manual annotation is resource-intensive, limiting the scale of the benchmark (336 dialogues)
Evaluation is currently limited to multiple-choice QA, which may not fully capture open-ended generation capabilities
Focuses on text-only modalities, excluding potential vocal or visual cues in human interaction

Reproducibility

Code: https://github.com/CGCL-codes/RecToM

Benchmark data and code publicly available at https://github.com/CGCL-codes/RecToM. Data derived from public ReDial dataset. Human annotation process and guidelines described in detail.

📊 Experiments & Results

Evaluation Setup

Multiple-choice Question Answering on RecToM benchmark

Benchmarks:

RecToM (Mental State Inference & Behavioral Prediction in CRS) [New]

Metrics:

Accuracy
Statistical methodology: Inter-Annotator Agreement (Fleiss's Kappa) reported for data quality.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Initial experiments reveal that RecToM is a challenging benchmark, with significant room for improvement even for advanced models.
RecToM	Inter-Annotator Agreement (Fleiss's K)	0.0	0.79	+0.79

Experiment Figures

Hierarchical categorization of intentions for Recommender and Seeker.

Main Takeaways

Increased option complexity significantly hinders ToM reasoning; LLMs struggle more with multiple-choice questions than binary ones.
Fine-grained intention discrimination is a bottleneck; models are better at identifying broad goals (e.g., 'ask') than specific nuances (e.g., 'ask for preference').
Models show early potential for multi-dimensional reasoning but struggle to integrate all cues (who suggested, seen status, liked status) consistently.
A strong 'sycophancy' bias exists where models prefer agreeable responses over objectively effective strategic judgments.
Chain-of-Thought (CoT) prompting yields limited or even negative benefits in this complex CRS setting, suggesting current prompting strategies are insufficient for dynamic social reasoning.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Conversational Recommender Systems (CRS)
Familiarity with Theory of Mind (ToM) concepts (Belief-Desire-Intention model)
Basic knowledge of LLM evaluation via Question Answering

Key Terms

ToM: Theory of Mind—the cognitive capacity to attribute mental states (beliefs, intents, desires, knowledge) to oneself and others

CRS: Conversational Recommender Systems—interactive systems that elicit user preferences and make recommendations through natural language dialogue

Sally-Anne test: A classic psychological test used to assess false-belief understanding, often involving scenarios where an object is moved while a character is absent

BDI model: Belief-Desire-Intention model—a software model developed for programming intelligent agents based on human psychology

Sycophancy: The tendency of models to produce responses that align with the user's view or are overly agreeable, even if factually incorrect or unhelpful

CoT: Chain-of-Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer

IAA: Inter-Annotator Agreement—a measure of how well multiple human annotators agree on labels (e.g., Fleiss's Kappa)