Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation

📝 Paper Summary

Conversational personalization User-profile based personalization

TOPDIAL is a dataset curation framework that uses role-playing LLM agents (User, System, Moderator) to synthesize large-scale personalized target-oriented dialogues where systems proactively lead conversations while adapting to user personalities.

Core Problem

Existing target-oriented dialogue datasets either lack personalization (ignoring user profiles/personalities) or are not proactive, and manual creation of high-quality personalized datasets is prohibitively expensive.

Why it matters:

Without personalization, proactive systems (like recommenders) can seem obtrusive or irrelevant, damaging user experience.
Existing datasets are often crowd-sourced without specific target goals or simply re-purposed from non-target data, lacking the specific dynamics of 'leading' a conversation.
Training models to be both proactive (reaching a goal) and personalized (respecting user style) requires data that exemplifies both simultaneously.

Concrete Example: In a movie recommendation scenario, a standard system might bluntly recommend 'King of Comedy'. A personalized system, knowing the user is 'shy' and likes 'Stephen Chow', would gently bridge the topic via the actor rather than the genre. Current datasets lack these nuanced, personality-driven transitions.

Key Novelty

LLM-based Role-Playing Data Curation Framework

Deploys three interacting LLM agents: a User agent (simulating specific profiles/Big-5 traits), a System agent (optimizing for target achievement), and a Moderator agent (managing termination).
Formulates the conversation target as a <dialogue act, topic> pair (e.g., <recommend, 'The Matrix'>) rather than just keywords, ensuring actionable goals.
Injects explicit personality traits (Big-5) into the User agent's prompt to generate diverse, human-like resistance or acceptance behaviors.

Architecture

The role-playing framework for automatic dataset curation involving three agents.

Evaluation Highlights

Alpaca-7B trained on TOPDIAL achieves 85.04% target success rate, a +36.26 point improvement over the same model trained on the seed dataset (DuRecDial 2.0).
Personalization F1 score improves by +14.94 points (51.99 vs 37.05) for Alpaca-7B when trained on TOPDIAL compared to the seed dataset.
Curated ~18K multi-turn dialogues across 4 domains (Movies, Music, Food, POIs) with an average of 12.3 utterances per dialogue.

Breakthrough Assessment

7/10

Significant contribution to data synthesis for a niche but important problem (personalized proactive dialogue). The role-playing framework is well-executed, though the core innovation is the application of LLMs to data curation rather than a new model architecture.

⚙️ Technical Details

Problem Definition

Setting: Personalized Target-oriented Dialogue

Inputs: Target T (<dialogue act, topic>), User Information U (profiles, personalities), Domain Knowledge K, Dialogue History C

Outputs: System utterance that proactively leads towards T while respecting U

Pipeline Flow

Initialize User Agent (Profile + Big-5)
Initialize System Agent (Target + Knowledge + User Profile)
Environment Context Injection
Turn-by-turn Interaction Loop
Moderator Agent Check (Termination)

System Modules

User Agent (Data Curation)

Simulate a human user with specific preferences and personality

Model or implementation: ChatGPT (gpt-3.5-turbo)

System Agent (Data Curation)

Proactively lead conversation to target while maintaining engagement

Model or implementation: ChatGPT (gpt-3.5-turbo)

Moderator Agent (Data Curation)

Determine if the conversation should end based on success or rejection

Model or implementation: ChatGPT (gpt-3.5-turbo)

Novel Architectural Elements

Environment-specific role-playing context injection
Moderator agent with explicit logic for target failure/success detection
Explicit injection of Big-5 personality traits into User Agent prompts

Modeling

Base Model: ChatGPT (gpt-3.5-turbo) for curation; DialoGPT-small and Alpaca-7B for validation

Training Method: Supervised Fine-Tuning (for validation baselines)

Adaptation: LoRA (for Alpaca-7B); Full fine-tuning (for DialoGPT)

Training Data:

TOPDIAL Dataset: 12,601 train / 1,802 valid / 3,606 test dialogues
Validation Baselines trained on equal-sized subset (5K dialogues) of Seed vs TOPDIAL

Key Hyperparameters:

max_decoding_length: 80
epochs: 5
temperature: 0.75 (for curation agents)
+ 1 more
max_turns: 8 (curation limit)

Compute: Curation cost: ~0.032 USD per dialogue via OpenAI API. Fine-tuning Alpaca-7B on 2 NVIDIA 3090 GPUs.

Comparison to Prior Work

vs. DuRecDial 2.0 (Seed): TOPDIAL is synthetic, ensuring every dialogue is explicitly target-driven and consistent with personality constraints
vs. OTTers: TOPDIAL is multi-turn and incorporates personalization (Big-5 traits)
vs. TG-ReDial: TOPDIAL targets specific pairs and includes rich user profiles, whereas TG-ReDial focuses on topic paths

Reproducibility

Code: https://github.com/iwangjian/TopDial

publicly available (https://github.com/iwangjian/TopDial). Code and dataset released. Implementation uses ChatArena library. Specific prompt templates provided in Appendix.

📊 Experiments & Results

Evaluation Setup

Train dialogue models (DialoGPT, Alpaca) on Seed vs. TOPDIAL datasets, evaluate on a mixed test set (50% Seed / 50% TOPDIAL).

Benchmarks:

TOPDIAL Test Set (Personalized Target-oriented Dialogue Generation) [New]
DuRecDial 2.0 (Re-purposed) Test Set (Conversational Recommendation)

Metrics:

Target Success Rate (Succ. %)
Persona F1 (%)
Knowledge F1 (%)
BLEU-1/2
Statistical methodology: Fleiss's kappa reported for human evaluation agreement.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis of baseline models (Alpaca-7B and DialoGPT) trained on the seed dataset versus the proposed TOPDIAL dataset shows significant improvements in personalization and target success.
Mixed Test Set	Succ. (%)	48.78	85.04	+36.26
Mixed Test Set	Persona F1 (%)	37.05	51.99	+14.94
Mixed Test Set	Knowledge F1 (%)	38.60	57.12	+18.52
Mixed Test Set	Succ. (%)	32.94	51.83	+18.89

Experiment Figures

Transitions of dialogue acts of the system through the first six rounds.

Win/Loss/Tie rates for TOPDIAL vs Seed dataset via ChatGPT and Human evaluation.

Main Takeaways

Training on TOPDIAL significantly improves Target Success Rate and Persona F1 compared to training on re-purposed human datasets, validating the dataset's quality.
The generated dialogues successfully transition from greeting to target acts (recommendations) through intermediate steps like 'elicit interest' and 'introduce attribute' (visualized in analysis).
LLM-based automatic evaluation and human evaluation show TOPDIAL has comparable or slightly better coherence and proactivity than the human-generated seed dataset.

📚 Prerequisite Knowledge

Prerequisites

Target-oriented dialogue systems
Large Language Models (LLMs) for role-playing
Big-5 personality traits

Key Terms

Big-5 personality traits: A psychological model describing personality via five dimensions: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism.

Target-oriented dialogue: Conversational systems designed to proactively steer the discussion toward a predefined goal (e.g., recommending a specific item) rather than just chatting passively.

DuRecDial 2.0: A widely used conversational recommendation dataset, used here as a seed source for knowledge and profiles.

Knowledge F1: A metric measuring the overlap between the generated response and the ground-truth domain knowledge entities.

Persona F1: A metric measuring the uni-gram overlap between the generated response and the user's grounded profile information.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices.

Seed dataset: The re-purposed DuRecDial 2.0 dataset used as the baseline and source of entities for generating TOPDIAL.