Expectation Confirmation Preference Optimization for Multi-Turn Conversational Recommendation Agent

📝 Paper Summary

Conversational Recommendation Systems (CRS) Preference Optimization (PO) User Simulation

ECPO aligns conversational agents with user expectations by explicitly modeling satisfaction via Expectation Confirmation Theory to generate turn-level preference pairs without costly sampling.

Core Problem

Current conversational agents often generate short-sighted responses that fail to sustain long-term guidance, and existing multi-turn preference optimization methods are inefficient due to high sampling costs and noisy intermediate rewards.

Why it matters:

Standard LLMs prioritize next-token prediction over long-term strategic guidance needed for recommendation
Existing approaches like MCTS-based optimization require expensive self-sampling to estimate turn-level rewards, which is computationally prohibitive
Randomness in simulated environments introduces noise into preference labels, degrading the alignment of the agent

Concrete Example: In a recommendation dialogue, a standard agent might prematurely recommend an item before fully understanding preferences. ECPO identifies this 'short-sightedness' by comparing the response against user expectations (flexibility/coherence), triggering a rewrite to ask a clarifying question instead.

Key Novelty

Expectation Confirmation Preference Optimization (ECPO)

Uses Expectation Confirmation Theory to simulate a user's inner monologue, assigning satisfaction scores to each turn based on flexibility, coherence, and guidance
Identifies low-satisfaction turns and uses a 'Backward Expectation Derivation' process to rewrite them, creating high-quality preference pairs (original vs. rewritten) without self-sampling
Introduces AILO, a user simulator based on Activities, Interests, Language, and Orientations to provide diverse feedback and perform the expectation confirmation process

Architecture

The complete ECPO pipeline: Simulator-Guided Planning Tuning, followed by the three-step optimization process (Forward Expectation Confirmation, Backward Expectation Derivation, Preference Optimization).

Evaluation Highlights

Outperforms DPO and KTO baselines on turn-level win rates against GPT-4, achieving a 64.0% win rate on the Multi-WOZ dataset
AILO user simulator generates significantly more diverse user personas (lower ROUGE-L scores) compared to the RecAgent baseline
AILO achieves a 100% win rate over iEvalLM in human evaluations regarding the human-like quality of simulated dialogue

Breakthrough Assessment

7/10

Novel application of psychological theory (ECT) to eliminate sampling overhead in preference optimization. Strong efficiency gains, though limited by reliance on the quality of the internal simulator.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn conversational recommendation where an agent interacts with a user to elicit preferences and recommend items from a database

Inputs: Dialogue history h_t and user utterance u_t

Outputs: Agent response p_t and internal reasoning cr_t

Pipeline Flow

Forward Expectation Confirmation: Simulator evaluates agent response
Backward Expectation Derivation: Rewriter refines unsatisfactory responses
Preference Optimization: Agent trained on original/rewritten pairs

System Modules

User Simulator (AILO)

Simulates user interactions and performs the Expectation Confirmation process to assign satisfaction scores

Model or implementation: GPT-4o (used for persona inference and simulation)

Rewriter

Refines agent responses that fall below a satisfaction threshold

Model or implementation: LLM (specific model not detailed, likely GPT-4o or similar high-capacity model)

Recommendation Agent (CRA)

Interacts with users to recommend items; the target of optimization

Model or implementation: Llama-3.1-8B-Instruct

Novel Architectural Elements

Integration of Expectation Confirmation Theory module directly into the training loop to generate rewards without external sampling
Backward Expectation Derivation mechanism that uses natural language feedback to guide the rewriting of negative samples

Modeling

Base Model: Llama-3.1-8B-Instruct

Training Method: Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Optimize policy to prefer rewritten responses over original unsatisfactory ones.

Formally: DPO loss minimizing -log(sigmoid(beta * (log(pi(yw|x)/pi_ref(yw|x)) - log(pi(yl|x)/pi_ref(yl|x)))))

Adaptation: Full fine-tuning (implied)

Training Data:

Dataset D_sft generated via interactions between GPT-4o-mini based CRA and simulator
Dataset D_pre constructed by identifying turns where satisfaction < lambda and rewriting them

Key Hyperparameters:

lambda: Satisfaction threshold (hyperparameter mentioned but exact value not specified in text)

Compute: Not reported in the paper

Comparison to Prior Work

vs. DPO/KTO: ECPO optimizes turn-level preferences derived from multi-turn context via ECT, whereas standard methods often treat turns independently or lack the specific satisfaction modeling.
vs. MCTS-based methods (e.g., in Intro): ECPO avoids the high sampling overhead of simulating full conversations to estimate intermediate rewards.
vs. RecAgent: AILO (ECPO's simulator) uses structured personas (AIO) resulting in lower ROUGE-L (higher diversity) than RecAgent.

Limitations

Relies heavily on the capability of the user simulator (AILO) and the Rewriter (LLM) to correctly assess and improve responses.
The exact computational cost of the simulation phase (using GPT-4o) might still be significant, even if training sampling is reduced.
Evaluation is primarily simulation-based; real-user evaluation is not reported.

Reproducibility

Code is not provided in the paper. The method relies on GPT-4o for simulation and data generation. Training dataset construction details are provided, but specific hyperparameters for DPO (learning rate, batch size) are not explicitly listed in the main text.

📊 Experiments & Results

Evaluation Setup

Turn-level evaluation using GPT-4 as a judge to compare win rates between the proposed method and baselines.

Benchmarks:

ReDial (Conversational Movie Recommendation)
Multi-WOZ (Task-oriented Dialogue (Restaurant/Hotel))
KuaiRec (Conversational Recommendation)

Metrics:

Win Rate (Turn-level)
Tie Rate
Lose Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ECPO demonstrates superior interactive capabilities across three datasets compared to SFT and other preference optimization methods.
ReDial	Win Rate	50.0	58.4	+8.4
Multi-WOZ	Win Rate	50.0	64.0	+14.0
KuaiRec	Win Rate	50.0	61.6	+11.6
Comparison against advanced preference optimization baselines (using SFT as anchor, implied).
ReDial	Win Rate	54.0	58.4	+4.4
Human evaluation confirms the superiority of the AILO simulator.
Human Evaluation	Win Rate (Human-likeness)	0.0	100.0	+100.0

Experiment Figures

Distribution of ROUGE-L scores for user personas generated by AILO vs. RecAgent.

Main Takeaways

ECPO consistently outperforms SFT and standard preference optimization methods (DPO, KTO) across multiple domains.
The generated preference pairs (original vs. rewritten) are effective for DPO, proving the 'Backward Expectation Derivation' concept works.
AILO produces more diverse and human-like user simulations than prior state-of-the-art simulators (RecAgent, iEvalLM), providing a solid foundation for the EC process.

📚 Prerequisite Knowledge

Prerequisites

Direct Preference Optimization (DPO)
Conversational Recommendation Systems (CRS)
Expectation Confirmation Theory (ECT)
Instruction Tuning

Key Terms

ECPO: Expectation Confirmation Preference Optimization—the proposed method that aligns agents using satisfaction scores derived from theory rather than sampling

AILO: Activities, Interests, Language, and Orientations—the proposed user simulator framework designed to create diverse personas

ECT: Expectation Confirmation Theory—a framework stating satisfaction comes from comparing expectations to actual performance

DPO: Direct Preference Optimization—an algorithm for finetuning LLMs on preference pairs without an explicit reward model

SFT: Supervised Fine-Tuning—initial training phase on demonstration data

ROUGE-L: A metric used here to measure the similarity between generated user personas; lower scores indicate higher diversity

Chain of Thought: A prompting technique where the model generates intermediate reasoning steps before the final answer