Eeyore: Realistic Depression Simulation via Supervised and Preference Optimization

📝 Paper Summary

User modeling (Clinical simulation) Safety alignment vs. Realism

Eeyore simulates realistic depression by aligning an 8B model to structured psychological profiles using a two-stage preference optimization process that generates negative training samples via profile noise injection.

Core Problem

General-purpose LLMs are optimized for safety and positivity, preventing them from authentically simulating the negative thought patterns, cognitive distortions, and self-harm ideation required for realistic clinical training.

Why it matters:

Novice counselors need realistic practice environments, but current LLM simulations are overly sanitized and fail to represent the severity of mental health conditions
Prompt engineering alone cannot overcome the inherent safety biases of models like GPT-4, leading to inauthentic 'perfect patient' interactions
Existing datasets lack the structured psychological metadata needed to control specific symptom manifestations (e.g., moderate vs. severe depression)

Concrete Example: When simulating a depressed client, a standard LLM might refuse to express hopelessness or self-harm ideation due to safety filters, whereas a real client with 'severe depression' would exhibit these traits. Eeyore uses a structured profile to force the model to adhere to these darker traits.

Key Novelty

Profile-Noise Augmented Preference Optimization

Uses a structured psychological profile (symptoms, severity, demographics) to guide model behavior via instruction tuning
Generates DPO (Direct Preference Optimization) negative samples by artificially injecting 'noise' into the profile (e.g., swapping 'severe' for 'mild') to force the model to distinguish between profile-compliant and slightly deviated responses
Integrates a second stage of expert-annotated preferences to calibrate the model against human clinical judgment

Architecture

The three-stage framework: (1) Language-Specific Alignment (Data Curation), (2) Profile-Guided Role-Playing (SFT), and (3) Iterative Preference Optimization (DPO).

Evaluation Highlights

96.0% of model-generated attributes comply with the assigned psychological profile according to a GPT-4o verifier
85.2% of extracted depression traits in the training data were verified as accurate by clinical experts
82.0% of expert annotations in the second stage indicated a clear preference for one response over another, facilitating effective preference learning

Breakthrough Assessment

8/10

Significant methodology for overcoming safety refusal in clinical simulation. The profile-noise DPO strategy is a clever solution to the 'model is too good to sample negatives' problem.

⚙️ Technical Details

Problem Definition

Setting: Role-play simulation conditional on a structured psychological profile

Inputs: System prompt containing psychological profile (demographics, symptoms, severity) + Conversation history

Outputs: Patient response y consistent with the profile

Pipeline Flow

Data Curation: Mine conversations -> Extract Profiles -> Rebalance
Phase 1: Instruction Tuning (SFT) on profile-dialogue pairs
Phase 2: DPO with Model-Generated Preferences (Profile Noise)
Phase 3: DPO with Expert-Annotated Preferences

System Modules

Profile Extractor

Extract structured depression traits from raw conversation text

Model or implementation: GPT-4o

Eeyore (SFT)

Learn to generate responses conditional on profiles

Model or implementation: 8B model (Instruction Tuned)

Preference Generator (Noise)

Create negative samples for DPO by perturbing the profile

Model or implementation: Eeyore (SFT version)

Novel Architectural Elements

Profile-Noise Augmented DPO: Generating negative samples (y_l) by prompting the model with a perturbed psychological profile (x_n) rather than the original input (x_o), forcing the model to learn fine-grained profile adherence

Modeling

Base Model: 8B model (exact architecture not specified in provided text, likely Llama-3-8B given context)

Training Method: Supervised Fine-Tuning (SFT) followed by two-stage Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Optimize policy to prefer profile-aligned responses over noisy-profile responses.

Formally: DPO loss L_DPO(pi_theta; pi_ref) = -E[log sigmoid(beta * log(pi_theta(yw|x)/pi_ref(yw|x)) - beta * log(pi_theta(yl|x)/pi_ref(yl|x)))]

Training Data:

3,042 high-quality conversations curated from RED, HOPE, ESC, AnnoMI datasets
1,933 model-generated preference pairs for Stage 1 DPO
250 expert-annotated preference pairs for Stage 2 DPO

Key Hyperparameters:

profile_noise_rate: 30% of attributes modified
probability_ratio_threshold_tau: 2

Compute: Not reported in the paper

Comparison to Prior Work

vs. Patient-psi: Eeyore uses direct preference optimization on profiles rather than just prompting cognitive models
vs. GPT-4o: Eeyore is fine-tuned on rebalanced, expert-curated data to overcome safety biases
vs. Standard DPO: Eeyore uses 'profile noise' to generate negatives because the SFT model is too good to generate bad responses naturally

Limitations

Relies on expert annotation which is costly and difficult to scale (Stage 2 DPO)
Depends on the quality of the initial profile extraction from noisy public data
Generated negative samples (y_n) might theoretically be outliers for the original distribution (mitigated by probability constraints)

Reproducibility

Code: https://anonymous.github.com

Data and code availability mentioned ('anonymous.github.com'). Detailed profile structure provided in text. Exact hyperparameters for SFT/DPO training (LR, batch size) not in provided text.

📊 Experiments & Results

Evaluation Setup

Interactive role-play simulation evaluated by GPT-4o verifier and human experts

Benchmarks:

Profile Adherence Verification (Automated Evaluation (LLM-as-a-judge)) [New]
Expert Preference Annotation (Human Evaluation) [New]

Metrics:

Profile Adherence Score (GPT-4o)
Expert Preference Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Validation of the training data and intermediate model performance.
Trait Extraction Verification	Accuracy	Not applicable	85.2%	Not applicable
Profile Adherence	Attribute Compliance %	Not applicable	96.0%	Not applicable
Profile Adherence	Perfect Match Rate	Not applicable	31.7%	Not applicable
Stage 2 DPO Data Collection	Clear Preference Rate	Not applicable	82.0%	Not applicable

Experiment Figures

The DPO sampling process using Profile Noise Augmentation.

Main Takeaways

The instruction-tuned model achieves high attribute compliance (96%) but often misses subtle details (only 31.7% perfect matches), necessitating DPO.
Standard preference sampling is ineffective because the model is too good; profile noise augmentation is required to generate useful negative samples.
Experts generally found the model outputs distinguishable and had clear preferences, validating the interactive evaluation approach.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Direct Preference Optimization (DPO)
Knowledge of Instruction Tuning (SFT)
Basic familiarity with clinical depression terminology (DSM-V, PHQ-9)

Key Terms

DPO: Direct Preference Optimization—a method to align models to preferences (e.g., 'response A is better than B') without a separate reward model

SFT: Supervised Fine-Tuning—training a model on high-quality input-output pairs (here, profile+context -> response)

Psychological Profile: A structured set of attributes (e.g., 'Gender: Female', 'Symptom: Insomnia', 'Severity: Severe') extracted from clinical literature (DSM-V) to define the simulated patient

Profile Noise Augmentation: A technique where the model generates a 'bad' response by using a slightly altered (noisy) profile, creating a negative sample for DPO training

Cognitive Distortions: Biased ways of thinking common in depression (e.g., catastrophizing), adapted from Beck's theory for the profiles

DSM-V: Diagnostic and Statistical Manual of Mental Disorders—the standard classification of mental disorders used by mental health professionals