CURIO: Curiosity-driven User-modeling Reward as an Intrinsic Objective—the proposed framework using intrinsic rewards for user learning
POMDP: Partially Observable Markov Decision Process—a framework where the agent makes decisions based on incomplete knowledge of the state (here, the unknown user type)
Intrinsic Reward: A reward signal generated internally by the agent (e.g., for learning or exploration) rather than from the external environment
PBRS: Potential-based Reward Shaping—a method of adding auxiliary rewards to accelerate learning without altering the optimal policy
Belief State: The agent's probability distribution over possible user types, updated as the conversation progresses
User Model: An auxiliary model (trained or prompted) that predicts the probability distribution of user types based on the dialogue context
GAE: Generalized Advantage Estimation—a technique used in RL to estimate the advantage of an action by balancing bias and variance
Information Gain: The reduction in entropy (uncertainty) regarding the user type after observing a new response