Recent Trends in Personalized Dialogue Generation: A Review of Datasets, Methodologies, and Evaluations

📝 Paper Summary

Personalized Dialogue Generation Benchmark Datasets Evaluation Methodologies

This survey systematizes the field of personalized dialogue generation by analyzing 22 datasets and classifying 17 recent methodologies into five core problem domains, ranging from consistency to data scarcity.

Core Problem

Personalized dialogue generation lacks a unified definition and standardized benchmarks; current research is fragmented across varying goals (agent persona vs. user modeling) and relies on datasets with significant domain and language biases.

Why it matters:

Personalization is critical for user engagement, yet current datasets are often limited in size, quality, and diversity
Methodologies vary widely in how they define 'persona' (explicit text vs. implicit history), making cross-comparison difficult without a structured taxonomy
Language bias is severe; most resources focus on English/Chinese, leaving other languages with insufficient training data for personalized agents

Concrete Example: While PersonaChat is the standard benchmark, it only contains English dialogues. Its multilingual variant, XPersona, is severely limited, containing only 280 dialogues for Italian, making it insufficient for robust training compared to the 10.9K dialogues in the English version.

Key Novelty

Comprehensive Taxonomy of Personalized Dialogue

Classifies 17 seminal works (2021-2023) into five distinct problem types: Consistency/Coherence, Persona-Context Balancing, Relevant Persona Selection, Unknown Persona Modeling, and Data Scarcity
Categorizes 22 datasets based on persona representation (descriptive sentences, key-value attributes, or user ID/history) and features (multi-session, multi-modal, grounding)

Architecture

Conceptual taxonomy of the three personalization scenarios in dialogue generation research

Evaluation Highlights

Identified 22 distinct datasets, establishing PersonaChat (10.9K dialogues) as the dominant benchmark used in 9 of 18 reviewed works
Highlighted severe data scarcity in multilingual resources, specifically noting XPersona contains only 280 Italian dialogues compared to thousands in English
Analyzed 17 top-conference papers to identify that recent trends represent persona in three distinct ways: description (most common), key-value attributes, or raw user history

Breakthrough Assessment

7/10

A solid systematic review that organizes a fragmented field. While it doesn't propose a new model, its taxonomy of 5 problem types and cataloging of 22 datasets provides a necessary roadmap for future research.

⚙️ Technical Details

Problem Definition

Setting: Conditional text generation where a response R is generated based on context C and conditioned on persona P (either agent persona P_A or user persona P_U).

Inputs: Dialogue context C (current query Q + history H) and Persona P (text descriptions, attributes, or history)

Outputs: Generated response R consistent with P and coherent with C

Comparison to Prior Work

vs. BoB: This paper surveys BoB as a method addressing the 'Consistency and Coherence' problem type
vs. FoCus: This paper categorizes FoCus as a dataset contributing 'persona grounding' features
vs. UA-CVAE: This paper categorizes UA-CVAE under 'Consistency and Coherence' methodologies using latent variables

Limitations

Datasets are heavily biased towards English and Chinese; other languages like Italian have negligible data (e.g., 280 dialogues)
Real-world persona distribution often exceeds crowdsourced datasets, leading to poor OOD performance
Evaluation metrics are not standardized; many works rely on overlapping but inconsistent metrics
Most datasets rely on static persona descriptions, failing to capture dynamic persona evolution over time

Reproducibility

Survey paper; reviews existing datasets. Mentions availability of specific datasets like PersonaChat (public), Japanese PersonaChat, and Korean PersonaChat (AI Hub). Code for the survey itself is not provided.

📊 Experiments & Results

Evaluation Setup

Systematic literature review of 17 papers from top conferences (ACL, NAACL, EMNLP, AAAI, 2021-2023) and 22 datasets.

Benchmarks:

PersonaChat (Persona-conditioned dialogue generation)
Multi-Session Chat (MSC) (Long-term memory dialogue generation)
XPersona (Multilingual persona dialogue)

Metrics:

BLEU
Perplexity (PPL)
Consistency Score (via NLI)
Persona Grounding Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Quantitative analysis of the dataset landscape reveals significant reliance on a single benchmark and severe data scarcity in multilingual settings.
PersonaChat	Number of Dialogues	0	10900	+10900
XPersona (Italian)	Number of Dialogues	10900	280	-10620
Literature Review	Count of works using PersonaChat	18	9	-9

Main Takeaways

Methodologies have shifted from simply endowing agents with static personas to actively modeling user personas (explicitly or implicitly) to improve engagement
Data scarcity is a major bottleneck, addressed either by augmentation (e.g., the D3 method) or by using LLMs for few-shot generation, though language bias remains a critical issue
Persona representation has diversified into three categories: descriptive sentences (most common), sparse key-value attributes, and raw user interaction history
Consistency is increasingly enforced via auxiliary NLI (Natural Language Inference) models rather than just relying on language modeling loss

📚 Prerequisite Knowledge

Prerequisites

Conversational Agents / Dialogue Systems
Natural Language Inference (NLI) for consistency
Variational Autoencoders (VAE) for latent variable modeling

Key Terms

Persona Grounding: Labels indicating which specific persona sentence or attribute a dialogue utterance is based on, allowing models to learn explicit associations

PersonaChat: A benchmark dataset of 10.9K English dialogues where paired crowdworkers chat while adopting specific persona descriptions

CVAE: Conditional Variational Autoencoder—a generative model used here to infer implicit persona information (latent variables) from dialogue history when explicit profiles are missing

NLI: Natural Language Inference—a classification task determining if one sentence entails, contradicts, or is neutral to another; used here to check if a generated response contradicts the agent's persona

Multi-Session Chat (MSC): A dataset extension where the same speakers converse over multiple sessions, requiring the agent to recall information from previous interactions

Zero-shot / Few-shot: Evaluating a model's ability to perform a task (here, personalized generation) with no or very few specific examples during training

Out-of-Distribution (OOD): Scenarios where the model encounters personas or topics during testing that were not present in the training data

BoB: BERT-over-BERT—a specific model architecture cited that uses NLI to ensure consistency between response and persona