Augmenting Dialog with Think-Aloud Utterances for Modeling Individual Personality Traits by LLM

📝 Paper Summary

Conversational personalization User modeling

Augmenting dialog training data with LLM-generated internal thoughts (Think-Aloud Utterances) before each response helps models better mimic specific human personality traits like Agreeableness and Neuroticism.

Core Problem

Modeling non-celebrity personalities is difficult because detailed profiles are scarce, and training on raw dialog alone often fails to capture internal psychological states driving behavior.

Why it matters:

Replicating non-celebrity personas is challenging due to the lack of public biographies or detailed profiles available for famous figures.
Surface-level utterances in chat logs often lack explicit information about the speaker's internal personality traits and emotional state.
Accurate personality modeling is crucial for consistent user interactions in entertainment and personalized agents.

Concrete Example: A speaker with high Neuroticism might say 'I'm in college now,' which seems neutral. However, their internal thought might be 'I feel anxious about the future.' Without the internal thought (TAU), the model misses the anxiety trait underlying the neutral statement.

Key Novelty

Think-Aloud Utterance (TAU) Augmentation

Synthetically insert a 'Think-Aloud Utterance' (TAU) before every target speaker's turn in a training dialog, verbalizing their hidden thoughts and feelings using a powerful LLM.
Fine-tune the persona model to generate both the internal thought and the final utterance, allowing it to learn the psychological process behind the speech.

Architecture

Conceptual diagram of the TAU augmentation process

Evaluation Highlights

Reduces MSE for Agreeableness and Neuroticism consistently across all tested base models (e.g., gpt-4o-mini MSE drops from 1.662 to 1.571 for Neuroticism) compared to standard dialog training.
Higher quality TAUs (generated by GPT-4o vs Qwen) lead to better personality alignment, further reducing MSE for Agreeableness by ~0.09 and Neuroticism by ~0.24 in gpt-4o-mini experiments.
Including explicit Big Five scores in the augmentation prompt further improves alignment for Extraversion, Agreeableness, and Neuroticism.

Breakthrough Assessment

6/10

A simple but effective data augmentation technique for personality modeling. Shows promise for internal state modeling, though gains are inconsistent across all Big Five traits and base models.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning LLMs to mimic specific target speakers using dialog history

Inputs: Dialog history up to turn t-1

Outputs: Target speaker's internal thought (TAU) followed by their surface utterance at turn t

Pipeline Flow

Data Augmentation: Use teacher LLM to insert TAUs into raw dialogs
Fine-tuning: Train student LLM on augmented dialogs (History → TAU + Utterance)

System Modules

TAU Augmenter

Generate internal thoughts for the target speaker based on context

Model or implementation: Qwen2.5-72B-Instruct or gpt-4o

Persona Model

Predict next turn (thought + speech) mimicking target personality

Model or implementation: Target LLM (e.g., gemma-2-9b-it) fine-tuned with QLoRA

Novel Architectural Elements

Training the persona model to output a structured internal thought block (<thinking> tags) derived from a synthetic teacher, specifically for personality alignment

Modeling

Base Model: Analyzed four base models: gpt-4o-mini, Llama-3-Swallow-8B-Instruct, Qwen2.5-7B-Instruct, gemma-2-9b-it

Training Method: Supervised Fine-Tuning (SFT) with QLoRA

Training Data:

RealPersonaChat (RPC) dataset: 3,929 dialogs from 20 target speakers
Split approx 8:1:1 for train/val/test per speaker
Augmented with TAUs using teacher models (Qwen2.5-72B or gpt-4o)

Key Hyperparameters:

lora_rank: 64
lora_alpha: 64
learning_rate: Selected per model based on validation similarity (BERTScore/ROUGE)

Compute: NVIDIA H100 SXM5 94GB (one GPU used)

Comparison to Prior Work

vs. Character-LLM: Focuses on non-celebrities where no external profile (Wikipedia) exists; infers thoughts solely from dialog context
vs. RoleLLM: Augments real human dialogs rather than relying on scraped or purely synthetic role-play data
vs. Standard Fine-Tuning (w/ RPC): Explicitly models the latent 'thought' step before utterance, rather than just mapping history to utterance

Limitations

Effectiveness varies by base model; Llama-3-Swallow and Qwen2.5-7B showed limited gains from fine-tuning generally.
Results are inconsistent for Openness, Conscientiousness, and Extraversion traits.
Relies on the quality of the teacher LLM to infer 'true' thoughts; real ground-truth thoughts are unavailable.
Artificial chat setting of the dataset (strangers chatting) may constrain natural personality expression.

Reproducibility

Code availability not provided. Dataset is RealPersonaChat (RPC). Prompt templates for TAU augmentation and Big Five evaluation are provided in appendices. Teacher models used for augmentation are accessible via APIs.

📊 Experiments & Results

Evaluation Setup

Personality assessment using a 60-item Big Five questionnaire (Wada, 1996)

Benchmarks:

Big Five alignment on RealPersonaChat speakers (Personality Trait Estimation)

Metrics:

Mean Squared Error (MSE) between model-predicted Big Five scores and human ground-truth scores
Statistical methodology: Pearson correlation between base model MSE and fine-tuning gains reported with p-values

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Fine-tuning with TAU (w/ +TAU) generally improves alignment (lower MSE) for Neuroticism and Agreeableness compared to standard fine-tuning (w/ RPC), though results for other traits are mixed.
Big Five (Neuroticism)	MSE (lower is better)	1.662	1.571	-0.091
Big Five (Agreeableness)	MSE (lower is better)	0.891	0.749	-0.142
Big Five (Mean)	MSE (lower is better)	1.280	1.212	-0.068
Ablation on TAU quality shows that better teacher models (gpt-4o vs Qwen) and providing Big Five hints (gpt+BF) lead to better student performance.
Big Five (Mean)	MSE Gain (w/+TAU - w/RPC)	0.139	0.024	-0.115
Big Five (Mean)	MSE Gain (w/+TAU - w/RPC)	0.024	-0.041	-0.065

Main Takeaways

TAU augmentation consistently helps models learn Agreeableness and Neuroticism, likely because these traits are strongly reflected in internal monologues (e.g., anxiety, empathy).
Effectiveness is dependent on the base model; Gemma-2 benefited most, while Llama-3-Swallow and Qwen-7B showed resistance to fine-tuning improvements generally.
Quality of the synthetic TAUs matters: TAUs generated by GPT-4o or with explicit personality hints produce better downstream personality alignment than those from weaker models.
Fine-tuning is most effective for target speakers whose personalities are distant from the base model's default personality (negative correlation observed).

📚 Prerequisite Knowledge

Prerequisites

Big Five personality traits (Ocean model)
Instruction fine-tuning of LLMs (QLoRA)
Chain-of-thought or internal monologue prompting concepts

Key Terms

TAU: Think-Aloud Utterance—a verbalization of a speaker's internal psychological state (thoughts/emotions) before they speak

Big Five: A standard psychological framework describing personality via five traits: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism

QLoRA: Quantized Low-Rank Adaptation—a parameter-efficient fine-tuning method that reduces memory usage by freezing the base model and training small adapters

MSE: Mean Squared Error—used here to measure the difference between the model's predicted personality score and the human speaker's true score

RealPersonaChat (RPC): A Japanese chat corpus containing dialogs where participants have annotated demographic and personality information

TRL: Transformer Reinforcement Learning library—a library used here for supervised fine-tuning

BERTScore: A metric evaluating text generation quality by comparing contextual embeddings of candidate and reference texts