Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models

📝 Paper Summary

Mental Health Analysis Chain-of-Thought Reasoning Clinical NLP

A structured Chain-of-Thought prompting framework that guides Large Language Models through emotion analysis, binary classification, and causal reasoning to accurately estimate depression severity scores.

Core Problem

Standard LLM-based depression detection lacks interpretability and struggles to identify nuanced linguistic cues like anhedonia (e.g., negated positive affect), often conflating symptom identification with severity assessment.

Why it matters:

Current end-to-end classification models fail to meet medical standards for auditability because they lack explicit reasoning steps.
Holistic text processing overlooks subtle diagnostic markers, such as implied expectations in phrases like 'Family gatherings should be fun, yet they feel meaningless.'
Clinical practice requires distinct processes for symptom detection (binary) and severity assessment (regression), which current models often merge.

Concrete Example: A phrase like 'I try to enjoy hobbies but feel nothing' contains subtle anhedonia cues (implicit expectation of joy + contextual negation). A standard model might miss this or treat it generally negatively, whereas the proposed method explicitly dissects the negated positive affect before assigning a depression score.

Key Novelty

Emotion-to-Reasoning Chain-of-Thought (CoT) Framework

Decomposes the diagnostic process into four clinical stages: Emotion Analysis, Binary Classification, Causal Reasoning, and Severity Assessment.
Mimics the workflow of mental health professionals by first identifying symptoms and root causes before calculating quantitative severity scores (PHQ-8).

Architecture

The four-stage Chain-of-Thought prompting framework pipeline.

Evaluation Highlights

Achieves 0.732 CCC (Concordance Correlation Coefficient) with GPT-4o using the proposed CoT strategy, outperforming the standard GPT-4o baseline (0.696 CCC).
Improves QwQ-32b-preview's performance significantly, raising CCC from 0.597 to 0.705 and reducing Mean Absolute Error (MAE) from 4.23 to 3.55.
Surpasses traditional multimodal deep learning baselines like CubeMLP (0.583 CCC) using only text input.

Breakthrough Assessment

7/10

Significant improvement in interpretability and accuracy for mental health estimation by aligning LLM reasoning with clinical workflows. Demonstrates that structured prompting can unlock performance gains even in smaller models.

⚙️ Technical Details

Problem Definition

Setting: Regression task to predict depression severity scores based on clinical interview transcripts.

Inputs: Text transcript s of a clinical interview.

Outputs: Estimated PHQ-8 score (0-24) and severity category (e.g., Minimal, Severe).

Pipeline Flow

Group: Structured Reasoning Prompting
Stage 1: Emotion Analysis -> Stage 2: Binary Classification -> Stage 3: Reasoning Analysis -> Stage 4: Severity Assessment

System Modules

Stage 1: Emotion Analysis (Structured Reasoning Prompting)

Extract detailed emotional signals including type, intensity, polarity, and source, specifically targeting negated positives (anhedonia).

Model or implementation: LLM (Inference only)

Stage 2: Binary Classification (Structured Reasoning Prompting)

Determine if the individual is 'Depressed' or 'Not Depressed' based on PHQ-8 guidelines.

Model or implementation: LLM (Inference only)

Stage 3: Reasoning Analysis (Structured Reasoning Prompting)

Identify underlying causes (if Depressed) or protective factors (if Not Depressed) across social, biological, and psychological dimensions.

Model or implementation: LLM (Inference only)

Stage 4: Severity Assessment (Structured Reasoning Prompting)

Calculate the final PHQ-8 score and severity category using insights from all previous stages.

Model or implementation: LLM (Inference only)

Novel Architectural Elements

Four-stage sequential prompting pipeline explicitly modeled on clinical diagnostic workflows (Symptom -> Diagnosis -> Etiology -> Severity)

Modeling

Base Model: Various LLMs evaluated: GPT-4o, Qwen2.5-Max, DeepSeek V3, GPT-o1-preview, DeepSeek-R1, QwQ-32b-preview, GPT-o3-mini

Compute: Not reported in the paper

Comparison to Prior Work

vs. CubeMLP/MIMRL: Uses text-only LLM reasoning rather than multimodal fusion; explicitly generates interpretable clinical reasoning steps instead of opaque feature vectors.
vs. Standard LLMs (Direct Prompting): Decomposes the task into four clinically aligned stages rather than asking for a direct score prediction.

Limitations

Relies solely on text modality, ignoring acoustic and visual cues present in the dataset.
Performance depends on the underlying LLM's reasoning capability; smaller models may still struggle.
Effectiveness relies on the accuracy of the initial emotion analysis stage; errors there propagate downstream.

Reproducibility

No replication artifacts mentioned in the paper (no code URL, prompt templates, or API scripts provided). The method relies on prompt engineering logic described in the text.

📊 Experiments & Results

Evaluation Setup

Depression severity estimation using the E-DAIC dataset (text modality only).

Benchmarks:

E-DAIC (Depression Severity Regression)

Metrics:

Concordance Correlation Coefficient (CCC)
Mean Absolute Error (MAE)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison with traditional multimodal baselines showing LLMs with CoT outperform deep learning fusion methods.
E-DAIC	CCC	0.583	0.732	+0.149
Ablation studies demonstrating the impact of the proposed CoT prompting strategy on Standard LLMs (without inherent CoT).
E-DAIC	CCC	0.550	0.637	+0.087
E-DAIC	MAE	4.33	4.07	-0.26
E-DAIC	CCC	0.696	0.732	+0.036
Ablation studies showing that even models with inherent CoT capabilities benefit from the specific Emotion-to-Reasoning structured framework.
E-DAIC	CCC	0.597	0.705	+0.108
E-DAIC	MAE	4.23	3.55	-0.68
E-DAIC	CCC	0.625	0.677	+0.052

Main Takeaways

Structured CoT prompting consistently improves depression detection performance across both standard LLMs and reasoning-enhanced LLMs.
The 'Emotion-to-Reasoning' framework effectively bridges the gap between raw text processing and clinical diagnostic standards.
LLMs using this text-only strategy can outperform complex multimodal systems that use audio and video, highlighting the density of diagnostic information in linguistic cues.
The method improves auditability by generating explicit lists of depressive factors (social, biological, psychological) alongside the score.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Chain-of-Thought (CoT) prompting
Familiarity with depression diagnostic criteria (PHQ-8)
Basics of regression evaluation metrics (CCC, MAE)

Key Terms

PHQ-8: Patient Health Questionnaire-8—a standardized clinical tool used to monitor the severity of depression, scoring symptoms from 0 to 24.

CCC: Concordance Correlation Coefficient—a statistic that measures agreement between two variables, evaluating both correlation and deviation from the 45-degree line (perfect agreement).

CoT: Chain-of-Thought—a prompting technique that encourages LLMs to generate intermediate reasoning steps before producing a final answer.

Anhedonia: A core symptom of depression defined as the inability to feel pleasure, often expressed linguistically through negated positive expectations.

E-DAIC: Extended Distress Analysis Interview Corpus—a dataset of clinical interviews used for benchmarking depression detection systems.

MAE: Mean Absolute Error—a metric quantifying the average absolute difference between predicted and actual values.