Dialogue Chain-of-Thought Distillation for Commonsense-aware Conversational Agents

📝 Paper Summary

Commonsense Reasoning in Dialogue Knowledge Distillation Chain-of-Thought Reasoning

The paper proposes a framework to distill high-quality multi-hop commonsense reasoning capabilities from unreliable Large Language Models into a smaller, efficient dialogue agent via rigorous alignment filtering.

Core Problem

Existing dialogue agents struggle with commonsense reasoning because key evidence is implicitly scattered across multiple turns, requiring multi-hop reasoning that single-hop knowledge models miss.

Why it matters:

Current methods like COMET generate single-hop inferences that fail to integrate scattered context, leading to dull or incoherent responses
Simply prompting LLMs for reasoning is unreliable due to hallucinations (unfaithfulness) and generation of irrelevant rationales (unhelpfulness)

Concrete Example: In a dialogue about a lost handbag, implicit clues appear across turns ('lunch', 'restaurant', 'didn't find it there'). A single-hop model vaguely suggests 'Call the police', whereas the proposed multi-hop reasoner correctly infers 'You should go to the restaurant where you had lunch'.

Key Novelty

Dialogue Chain-of-Thought (CoT) Distillation & DOCTOR

Formulates dialogue reasoning as a multi-step QA process (Dialogue CoT) to uncover implicit evidence, rather than a single generation step
Uses an 'unreliable' LLM (ChatGPT) to generate candidates, then cleans them with two alignment filters: one for consistency (detecting hallucinations) and one for helpfulness (relevance to response)
Distills this filtered data into a smaller, specialized model (DOCTOR) that generates superior rationales compared to the original LLM prompting

Architecture

The distillation framework: (1) Rationale Generation using LLM, (2) Alignment Filtering (Critic + Helpfulness), and (3) Training the DOCTOR model.

Evaluation Highlights

ChatGPT augmented with DOCTOR achieves +2.36 BLEU-1 and +1.26 BLEU-2 over vanilla ChatGPT on the DailyDialog dataset
Human judges prefer responses generated using DOCTOR over those using 'Self-CoT' (ChatGPT prompting itself) 67% of the time for naturalness
DOCTOR generalizes well: on the out-of-domain Reflect-9K dataset, it outperforms the 'Reflect' baseline (+0.30 BLEU-1) which was trained specifically on that domain

Breakthrough Assessment

8/10

Significant contribution in defining 'Dialogue CoT' and proving that distilled, filtered small models can outperform large teacher models in reasoning quality. The release of the DONUT dataset is a major resource.

⚙️ Technical Details

Problem Definition

Setting: Response generation conditioned on dialogue history and generated commonsense rationales

Inputs: Dialogue context U (sequence of utterances)

Outputs: Next response u_t, conditioned on a generated rationale Z (sequence of QA pairs)

Pipeline Flow

Rationale Candidate Generation (ChatGPT + ATOMIC relations)
Rationale-to-Context Filtering (Critic Model)
Rationale-to-Response Filtering (Helpfulness Check)
Training DOCTOR (Fine-tuning OPT on filtered data)

System Modules

Rationale Generator (Teacher)

Generate candidate QA pairs (rationales) for dialogue contexts using ATOMIC relation types as prompts

Model or implementation: ChatGPT (gpt-3.5-turbo)

Critic Model (Filter 1) (Data Filtering)

Filter out rationales that are factually inconsistent (hallucinated) with the context

Model or implementation: RoBERTa-large

Helpfulness Filter (Filter 2) (Data Filtering)

Filter out rationales that do not increase the probability of the ground-truth response

Model or implementation: Cosmo-3B (Pre-trained Dialogue Model)

DOCTOR (Student)

Generate reliable Dialogue CoT rationales for unseen dialogues

Model or implementation: OPT-1.3B

Novel Architectural Elements

Dual-filter distillation pipeline: Specifically combining a RoBERTa-based consistency critic with a perplexity-based helpfulness filter to clean CoT data

Modeling

Base Model: OPT-1.3B

Training Method: Supervised Fine-Tuning (Distillation)

Objective Functions:

Purpose: Minimize negative log-likelihood of the rationale sequence.

Formally: Standard Causal Language Modeling (Next Token Prediction) on the sequence [U_t; Z].

Training Data:

DONUT dataset (10K dialogues from DailyDialog, DREAM, MuTual, SODA)
Annotated with 367K CoT rationales after filtering

Key Hyperparameters:

learning_rate: 5e-4
epochs: 5
batch_size: Not explicitly reported in the paper
+ 1 more
filter_threshold_tau: 0.95

Compute: ChatGPT annotation cost: ~1,200 USD. Inference time 0.1 seconds per sample for annotation.

Comparison to Prior Work

vs. COMET/DIALeCT: DOCTOR generates multi-hop rationales (QA chains) rather than single triplets, capturing implicit context better
vs. Self-CoT: DOCTOR is trained on filtered, high-quality rationales, whereas Self-CoT uses raw LLM outputs which are often inconsistent or unhelpful
vs. Scott [not cited in paper]: Scott distills CoT for QA; this paper distills CoT specifically for dialogue response generation with dialogue-specific filters

Limitations

Experiments limited to open-domain, dyadic (two-person) dialogues
Rationale annotations are fully machine-generated (ChatGPT), creating potential risk if filters fail
Requires an additional inference step (rationale generation) before response generation, adding latency
Depends on the quality of the 'unreliable teacher' (ChatGPT) to generate at least some valid candidates

Reproducibility

Code: https://github.com/kyle8581/DialogueCoT

Code and dataset (DONUT) are publicly available on GitHub and HuggingFace. The critic model training data generation logic is described. Specific prompt templates for ChatGPT are provided in Appendix.

📊 Experiments & Results

Evaluation Setup

Dialogue Response Generation augmented with commonsense knowledge

Benchmarks:

DailyDialog (Open-domain dialogue)
DREAM (Dialogue reading comprehension (adapted for generation))
MuTual (Multi-turn dialogue reasoning)
Reflect-9K (Dialogue common ground (Out-of-Domain test))

Metrics:

BLEU-1, BLEU-2, BLEU-4
ROUGE-L
Human Evaluation (Naturalness, Consistency, Specificity, Engagement/Helpfulness)
Statistical methodology: T-test (p < 0.05) used for human evaluation significance

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Automatic evaluation on DailyDialog shows DOCTOR significantly improves performance of large models (ChatGPT) and specialized models (Cosmo-3B).
DailyDialog	BLEU-1	17.25	19.61	+2.36
DailyDialog	BLEU-1	18.16	19.61	+1.45
DailyDialog	ROUGE-L	13.69	14.68	+0.99
Out-of-domain evaluation on Reflect-9K shows generalization capabilities.
Reflect-9K	BLEU-1	18.24	18.54	+0.30
Human evaluation confirms qualitative superiority of DOCTOR rationales.
DailyDialog	Naturalness (Win Rate)	33	67	+34

Experiment Figures

Head-to-head human evaluation comparing DONUT dataset quality against human-authored datasets (CICERO, Reflect).

Analysis of rationale alignment and response coherence.

Main Takeaways

Distilling LLM reasoning capabilities via rigorous filtering produces a student model (DOCTOR) that is more reliable than the teacher LLM (Self-CoT).
Multi-hop reasoning (Dialogue CoT) is superior to single-hop commonsense methods (COMET, DIALeCT) for capturing implicit dialogue context.
The 'Question' part of the QA rationale is crucial; ablations show that generating answers alone (without questions) significantly degrades response quality.
The alignment filters are essential: removing either the consistency filter or the helpfulness filter leads to significant drops in human preference win rates.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Distillation
Chain-of-Thought (CoT) Prompting
Commonsense Knowledge Graphs (e.g., ATOMIC)
Causal Language Modeling

Key Terms

Dialogue CoT: Dialogue Chain-of-Thought—a reasoning format decomposing commonsense inference into a sequence of question-answer pairs derived from conversation history

DOCTOR: DialOgue Chain-of-ThOught Reasoner—the specific model (based on OPT-1.3B) trained in this paper to generate rationales

DONUT: DialOgue chaiN-of-thoUght dataseT—the large-scale dataset of 10K dialogues with filtered, high-quality CoT annotations constructed in this paper

Alignment Filters: Mechanisms used to select high-quality rationales; specifically checking for consistency (not hallucinating) and helpfulness (improving response probability)

COMET: Commonsense Transformers—a standard baseline model that generates single-hop commonsense inferences

ATOMIC: A large-scale atlas of everyday commonsense reasoning (If X happens, X likely wants/needs/feels Y)

Self-CoT: A baseline method where the LLM (ChatGPT) prompts itself to generate a rationale before answering, without the proposed filtering/distillation

OPT-1.3B: Open Pre-trained Transformer—a decoder-only language model used as the backbone for the DOCTOR model

Cosmo-3B: A dialogue-specialized language model used here as a base agent to evaluate the helpfulness of generated rationales