Explainable Depression Detection in Clinical Interviews with Personalized Retrieval-Augmented Generation

📝 Paper Summary

Modularized RAG pipeline Explainable AI for Mental Health

RED uses a personalized retrieval-augmented framework to detect depression in clinical interviews by generating user-specific queries and enhancing evidence with external social intelligence knowledge.

Core Problem

Automated depression detection often relies on black-box neural networks lacking interpretability or post-hoc LLM explanations prone to hallucination, while standard retrieval methods fail to account for the highly personalized nature of patient interviews.

Why it matters:

Clinical interviews are the gold standard for diagnosis but require scarce professional resources, creating a need for automated but transparent systems.
Existing black-box models provide no rationale for their predictions, which is critical in high-stakes mental health contexts.
Generic retrieval queries ignore individual patient backgrounds (e.g., specific symptoms or life events), leading to suboptimal evidence gathering.

Concrete Example: A standard system might ask a generic query about 'sleep issues' for all patients. However, for a patient mentioning 'insomnia due to work stress,' a personalized query tailored to that context would retrieve more relevant dialogue snippets, whereas the generic query might miss nuances or retrieve irrelevant chatter.

Key Novelty

RED (Retrieval-augmented Explainable Depression detection)

Tailors retrieval queries to each patient by first using an LLM to infer a user profile from the dialogue, then generating specific queries for depression symptoms based on that profile.
Enhances LLM reasoning with 'social intelligence' by retrieving relevant psychological concepts from an external knowledge graph (COKE) using event-centric retrieval.
Uses an adaptive judgment module to decide when enough evidence has been collected, stopping retrieval early if sufficient information is found.

Architecture

The overall architecture of the RED framework, illustrating the flow from user profiling to personalized query generation, adaptive retrieval, social intelligence enhancement, and final prediction.

Evaluation Highlights

Outperforms state-of-the-art multimodal baselines (e.g., SEGA) by +4.0% in Macro F1 score on the DAIC-WoZ benchmark.
Achieves higher precision (+6.0%) and recall (+3.0%) for the depressed class compared to the best LLM-based method (Personal RAG).
Ablation studies confirm the Social Intelligence Enhancement module contributes significantly, improving Macro F1 by approximately 4% compared to the base model without it.

Breakthrough Assessment

7/10

Strong application of RAG to a sensitive domain with novel personalization and external knowledge integration components. Results are solid, though the scope is limited to one dataset.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of depression status based on clinical interview transcripts.

Inputs: Clinical interview transcript D, PHQ-8 aspect set A

Outputs: Predicted binary depression label y^ (0 for control, 1 for depressed)

Pipeline Flow

User Profiling: Transcript → User Profile
Personal Query Generation: Basic Queries + Profile → Personal Queries
Adaptive Retrieval: Personal Queries + Transcript → Evidence Set (with Stop Signal)
Social Intelligence Enhancement: Evidence → Event Extraction → Knowledge Retrieval → Enhanced Evidence
Prediction: Enhanced Evidence → LLM → Diagnosis

System Modules

User Profiling Agent (Input Processing)

Summarize the user's background and context from the raw transcript to enable personalization.

Model or implementation: gpt-4o-2024-08-06

Personal Query Generator (Input Processing)

Rewrite standard PHQ-8 queries into personalized queries based on the user profile.

Model or implementation: GPT model (implied GPT-4 family)

Adaptive Retriever (Retrieval & Selection)

Retrieve dialogue snippets relevant to each personalized query until a judge signals to stop.

Model or implementation: text-embedding-3-large (Dense Retriever) + LLM Judge Agent

Social Intelligence Enhancer (Retrieval & Selection)

Augment dialogue evidence with psychological knowledge by retrieving relevant chains from the COKE knowledge graph.

Model or implementation: LLM Event Extractor + MORE-CL Event Encoder

Diagnosis Generator

Predict depression status based on the enhanced evidence.

Model or implementation: LLM (gpt-4o, gpt-4o-mini, or gpt-4)

Novel Architectural Elements

Personalized Query Generation Module: Modifies retrieval queries based on an inferred user profile before retrieval.
Social Intelligence Enhancement Module: A secondary retrieval step using event-centric matching against a theory-of-mind knowledge graph (COKE) to augment primary evidence.

Modeling

Base Model: GPT-4o / GPT-4 family (API-based)

Compute: Single NVIDIA GeForce RTX 3090

Comparison to Prior Work

vs. Naive RAG: RED adds user profiling for query customization and an adaptive stop mechanism.
vs. Personal RAG: RED adds the social intelligence enhancement module using the COKE knowledge base.
vs. SEGA: RED is an explainable retrieval-based framework rather than a graph neural network approach.

Limitations

Relies on closed-source commercial LLMs (GPT-4), limiting reproducibility and cost-effectiveness.
Evaluation is limited to a single dataset (DAIC-WoZ) due to scarcity of clinical interview data.
The retrieval stop mechanism and event extraction rely on LLM calls, increasing inference latency and cost.

Reproducibility

No code URL provided. The method relies on closed-source OpenAI models (gpt-4o, text-embedding-3-large). Prompts are provided in Appendix A.1. The DAIC-WoZ dataset requires access permission.

📊 Experiments & Results

Evaluation Setup

Binary classification on the DAIC-WoZ dataset (Depressed vs. Control).

Benchmarks:

DAIC-WoZ (Depression Detection)

Metrics:

Precision (Depressed/Control)
Recall (Depressed/Control)
F1 score (Depressed/Control)
Macro F1
Statistical methodology: Average scores of 3 runs reported. No significance tests explicitly reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RED outperforms both neural network baselines and LLM-based approaches on the DAIC-WoZ dataset.
DAIC-WoZ	Macro F1	0.80	0.84	+0.04
DAIC-WoZ	Macro F1	0.80	0.84	+0.04
DAIC-WoZ	Depressed F1	0.75	0.80	+0.05
DAIC-WoZ	Control F1	0.85	0.89	+0.04
Ablation studies demonstrate the contribution of each module.
DAIC-WoZ	Macro F1	0.80	0.84	+0.04
DAIC-WoZ	Macro F1	0.76	0.84	+0.08

Main Takeaways

Personalization is critical: Tailoring queries to user profiles significantly improves retrieval quality compared to generic queries (+8% Macro F1 vs Naive RAG).
Social Intelligence matters: Augmenting LLMs with external theory-of-mind knowledge (COKE) provides a further performance boost (+4% Macro F1), helping the model interpret social cues.
RED achieves state-of-the-art results on DAIC-WoZ, surpassing complex multimodal neural networks (like SEGA) using only text transcripts and RAG.
The framework offers interpretability by providing the specific retrieved dialogue snippets and knowledge base entries used to make the diagnosis.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with depression detection metrics (PHQ-8)
Basic knowledge of Large Language Models (LLMs) and prompting

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

PHQ-8: Patient Health Questionnaire-8—a standard clinical survey used to screen for depression by assessing eight specific symptoms

DAIC-WoZ: Distress Analysis Interview Corpus-Wizard of Oz—a widely used dataset containing clinical interviews for depression detection

LLM: Large Language Model—a deep learning algorithm that can recognize, summarize, translate, predict, and generate text

COKE: A cognitive knowledge graph for machine theory of mind, used here to provide social intelligence context

Macro F1: An evaluation metric that calculates the average F1 score (harmonic mean of precision and recall) for each class, treating all classes equally

Event-centric retrieval: Extracting 'event triplets' (subject, predicate, object) from text to use as search queries against a knowledge base

Dense retriever: A retrieval method that uses vector embeddings to find semantically similar text rather than just keyword matching