Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

📝 Paper Summary

Conversational personalization Multi-turn Reinforcement Learning

CURIO enhances conversational personalization by incorporating a curiosity-based intrinsic reward into multi-turn RLHF, incentivizing agents to actively infer latent user traits during dialogue without requiring pre-existing profiles.

Core Problem

Standard RLHF optimizes for average user preferences or relies on extensive pre-collected user history, failing to adapt to new users with unknown traits during live interactions.

Why it matters:

One-size-fits-all models fail in high-stakes domains like education and healthcare where individual traits (e.g., learning style, emotional state) determine success
Real-world deployments often lack rich prior user data (cold start), rendering history-dependent personalization methods ineffective
Current methods neglect long-term personalization by optimizing single-turn rewards, failing to strategically gather information over a conversation

Concrete Example: A therapeutic chatbot trained on average user data might offer generic advice that fails to build rapport with a specific user, because it never learned to ask questions about the user's emotional history to tailor its approach.

Key Novelty

Curiosity-driven User-modeling Reward as an Intrinsic Objective (CURIO)

Treat the user as a hidden environment state to be explored; the agent receives an intrinsic reward for actions that reduce uncertainty about the user's type
Employ a separate 'User Model' that predicts user traits based on conversation history; the Policy is rewarded when it improves this User Model's prediction accuracy or reduces its entropy
Integrate this intrinsic reward into multi-turn RLHF, balancing the exploitation of helpfulness rewards with the exploration of user attributes

Architecture

The training framework involving four distinct models: Policy, Environment, Reward, and User Model.

Breakthrough Assessment

7/10

Novel application of intrinsic motivation and POMDP theory to LLM personalization, addressing the cold-start problem without pre-computed profiles. Theoretical grounding in reward shaping is strong.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) where user type 'u' is unobservable

Inputs: Conversation history s_t including previous utterances and actions

Outputs: Next agent response (action) a_t

Pipeline Flow

Group Interaction: Policy Model + Environment Model (User Simulator) generate dialogue
Group Evaluation: User Model predicts user type → Intrinsic Reward
Group Evaluation: Reward Model evaluates full conversation → Extrinsic Reward
Update: Policy Model updates via RL using combined rewards

System Modules

Policy Model (Interaction)

Generates conversational responses (actions) to maximize total reward

Model or implementation: LLM (architecture not specified in text)

Environment Model (Interaction)

Simulates a human user with a specific (hidden) user type

Model or implementation: LLM (fixed simulator)

User Model (Evaluation)

Predicts the probability distribution over user types (belief) based on context

Model or implementation: Parameterized model (trained or prompted LLM)

Reward Model (Evaluation)

Evaluates overall conversation quality (helpfulness/safety)

Model or implementation: Standard RLHF reward model

Novel Architectural Elements

Inclusion of an auxiliary 'User Model' specifically to compute curiosity rewards based on belief updates
Decoupled deployment of the User Model (accessed via remote API) to handle computational complexity during multi-turn RL training

Modeling

Base Model: Not reported in the provided text

Training Method: Multi-turn Reinforcement Learning (Online)

Objective Functions:

Purpose: Maximize expected cumulative reward including both extrinsic and intrinsic signals.

Formally: V^π(s_0) = E[Σ γ^t (R_ext + R_int)]
Purpose: Calculate intrinsic reward based on belief improvement (Accuracy).

Formally: R_int = b_{t+1}(u*) - b_t(u*)
Purpose: Calculate intrinsic reward based on belief improvement (Log Probability).

Formally: R_int = log(b_{t+1}(u*)) - log(b_t(u*))
Purpose: Calculate intrinsic reward based on uncertainty reduction (Entropy/Info Gain).

Formally: R_int = H(b_t) - H(b_{t+1})

Compute: Not reported in the provided text

Comparison to Prior Work

vs. Poddar/Chen et al.: CURIO performs online personalization (learning during the chat) without requiring pre-collected user data or profiles
vs. Hong et al.: CURIO focuses on open-ended preference learning via online RL exploration rather than offline optimization of fixed goals
vs. VIME: Adapts intrinsic motivation from standard RL control tasks to the domain of LLM dialogue personalization

Limitations

User Model quality dependence: The intrinsic reward relies entirely on the accuracy and calibration of the auxiliary User Model
Computational cost: Requires maintaining/querying multiple LLMs (Policy, Environment, User Model, Reward Model) simultaneously
Sparse extrinsic rewards: Still relies on end-of-conversation rewards for the base task, which can be difficult to optimize

Reproducibility

No replication artifacts mentioned in the paper text provided. Code availability is 'not provided'.

📊 Experiments & Results

Evaluation Setup

Two conversational domains: Educational Dialogue and Exercise Recommendation

Benchmarks:

Education Dialogue (Personalized teaching/tutoring)
Exercise Recommendation (Conversational recommendation)

Metrics:

Personalization performance (User Model Accuracy)
Conversation Quality
Generalization to unseen users
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The method demonstrates superior performance in rapidly adapting to individual users compared to baselines.
CURIO motivates the LLM to actively reduce uncertainty about users by asking insightful questions.
The approach shows improved generalization capabilities to entirely unseen users compared to traditional multi-turn RLHF.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Partially Observable Markov Decision Processes (POMDPs)
Bayesian belief updates
Intrinsic Motivation / Curiosity-driven learning

Key Terms

CURIO: Curiosity-driven User-modeling Reward as an Intrinsic Objective—the proposed framework using intrinsic rewards for user learning

POMDP: Partially Observable Markov Decision Process—a framework where the agent makes decisions based on incomplete knowledge of the state (here, the unknown user type)

Intrinsic Reward: A reward signal generated internally by the agent (e.g., for learning or exploration) rather than from the external environment

PBRS: Potential-based Reward Shaping—a method of adding auxiliary rewards to accelerate learning without altering the optimal policy

Belief State: The agent's probability distribution over possible user types, updated as the conversation progresses

User Model: An auxiliary model (trained or prompted) that predicts the probability distribution of user types based on the dialogue context

GAE: Generalized Advantage Estimation—a technique used in RL to estimate the advantage of an action by balancing bias and variance

Information Gain: The reduction in entropy (uncertainty) regarding the user type after observing a new response