Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models

📝 Paper Summary

Diffusion Language Models Chain-of-Thought Reasoning Non-autoregressive Text Generation

Diffusion-of-Thought enables diffusion language models to perform complex multi-step reasoning by treating reasoning steps as latent variables that diffuse over time, allowing for flexible computation trade-offs and self-correction.

Core Problem

Autoregressive Chain-of-Thought (CoT) suffers from error accumulation (left-to-right bias) where early mistakes propagate to final answers, and it lacks flexibility to trade computation for performance dynamically.

Why it matters:

Errors in early reasoning steps of autoregressive models often lead to irreversible failures in the final answer.
Autoregressive models have a fixed computational cost per token, whereas difficult problems might benefit from flexible 'thinking time' without generating more text.
Existing pre-trained diffusion language models (like Plaid, SEDD) lag behind autoregressive models in complex reasoning capabilities.

Concrete Example: In a math problem, an autoregressive model might generate '2*3=4' early on. Because it generates token-by-token left-to-right, it is forced to condition on this error for all subsequent steps, leading to a wrong answer. DoT can correct '<2*3=4>' to '<2*3=6>' in later diffusion timesteps before finalizing the output.

Key Novelty

Reasoning as a Denoising Process

Treats intermediate reasoning steps (thoughts) as latent variables that are gradually denoised from random noise alongside the final answer.
Allows the model to 'think' in parallel and revise earlier parts of the reasoning chain during the generation process (self-correction), unlike the rigid left-to-right generation of autoregressive models.
Introduces a Multi-Pass variant (DoTMP) that generates one thought at a time to combine the benefits of diffusion flexibility with causal inductive bias.

Architecture

Conceptual comparison of Answer-only, CoT, Implicit CoT, and Diffusion-of-Thought (DoT).

Evaluation Highlights

Small diffusion model (SEDD-medium, 424M) outperforms a 1.8x larger GPT-2 Large (774M) on Grade School Math (GSM8K) by ~8.7% (53.5% vs 44.8%).
Achieves up to 27x speed-up on simple digit multiplication tasks compared to autoregressive CoT without performance drop.
Self-consistency decoding boosts SEDD-medium performance on GSM8K from 53.5% to 59.4%.

Breakthrough Assessment

7/10

First successful application of CoT to pre-trained diffusion LMs with competitive results against AR baselines. While absolute performance is below SOTA LLMs, it proves diffusion models can reason and self-correct.

⚙️ Technical Details

Problem Definition

Setting: Conditional text generation where problem statement 's' is the condition and rationales 'r' plus answer 'a' are generated via reverse diffusion.

Inputs: Problem statement s

Outputs: Chain of rationales r_1...n and final answer a

Pipeline Flow

Input Problem (s)
Noise Initialization (z_T)
Iterative Denoising (Reverse Diffusion Process)
Final Output (Rationales r + Answer a)

System Modules

Diffusion Backbone

Predicts the denoised token distribution or score given noisy latent z_t and timestep t

Model or implementation: Plaid (1.3B) or SEDD (Small/Medium)

ODE Solver (Conditional)

Accelerates sampling for continuous diffusion models by solving the probability flow ODE

Model or implementation: Adapted DPM-Solver

Novel Architectural Elements

Integration of CoT rationales directly into the diffusion latent space, allowing 'vertical' reasoning depth (timesteps) alongside 'horizontal' reasoning breadth (tokens)
Multi-Pass (DoTMP) architecture that chains multiple diffusion processes for sequential thought generation

Modeling

Base Model: Plaid (1.3B), SEDD-small (170M), SEDD-medium (424M)

Training Method: Fine-tuning with diffusion loss (Variational Lower Bound)

Objective Functions:

Purpose: Minimize the negative variational lower bound of the data log-likelihood.

Formally: L_VLB = E[Prior Loss + Diffusion Loss + Rounding Loss]

Adaptation: Full fine-tuning

Training Data:

Augmented GSM8K dataset (training set)
BIG-bench multiplication (4x4, 5x5)
Boolean logic reasoning tasks

Key Hyperparameters:

scheduled_sampling_min_epsilon: 0.95
coupled_sampling_gamma: 0.01
coupled_sampling_k: 1
+ 3 more
self_consistency_samples: 20
inference_timesteps_T: Dynamic (default 64, tested 1-256)
classifier_free_guidance: Used (DiffuSeq-style)

Compute: 8 NVIDIA V100-32G GPUs for experiments

Comparison to Prior Work

vs. CoT (Autoregressive): DoT allows global planning and revising of thoughts via diffusion steps, rather than rigid left-to-right generation.
vs. Implicit CoT: DoT generates explicit thoughts in the diffusion process, allowing interpretability and self-correction, whereas Implicit CoT hides them in activations.
vs. DiffuSeq: DoT incorporates specific self-correction training objectives (scheduled/coupled sampling) and is applied to reasoning tasks.

Limitations

Performance heavily depends on the base pre-trained diffusion model quality; cannot yet compete with proprietary LLMs (GPT-4).
Continuous diffusion models (Plaid) require careful tuning of ODE solvers for efficiency.
Requires fine-tuning on reasoning datasets; zero-shot capabilities not explored extensively compared to large AR models.

Reproducibility

Code: https://github.com/HKUNLP/diffusion-of-thoughts

Code publicly available at https://github.com/HKUNLP/diffusion-of-thoughts. Pre-trained models (Plaid, SEDD) are open-source. Augmented datasets used from Implicit CoT paper.

📊 Experiments & Results

Evaluation Setup

Fine-tuning on task-specific training data, evaluating on test sets. Comparison against fine-tuned GPT-2 variants.

Benchmarks:

GSM8K (Augmented) (Grade School Math)
BIG-bench Multiplication (Arithmetic (4x4, 5x5 digits))
Boolean Logic (Logical Reasoning)

Metrics:

Exact Match Accuracy
Throughput (samples/second)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Grade School Math (GSM8K) shows diffusion models outperforming similarly sized autoregressive baselines.
GSM8K (Augmented)	Exact Match Accuracy	44.8	53.5	+8.7
GSM8K (Augmented)	Exact Match Accuracy	43.9	53.5	+9.6
GSM8K (Augmented)	Exact Match Accuracy	32.6	36.3	+3.7
Efficiency results on simple tasks demonstrate significant speedups.
4x4 Multiplication	Throughput (it/sec)	2.3	62.5	+60.2
GSM8K (Augmented)	Accuracy	31.2	32.6	+1.4

Experiment Figures

Effectiveness of ODE Solver on Plaid DoT inference speed.

Accuracy vs. Reasoning Steps (Compute) trade-off.

Main Takeaways

DoT allows trading off compute for accuracy: allocating more diffusion timesteps consistently improves performance on hard tasks (GSM8K).
DoT exhibits self-correction: qualitative examples show the model correcting intermediate erroneous thoughts (e.g., <2*3=4> -> <2*3=6>) in later diffusion steps.
Self-consistency works well with DoT, leveraging the inherent stochasticity of the diffusion process to generate diverse reasoning paths.

📚 Prerequisite Knowledge

Prerequisites

Diffusion Probabilistic Models (Forward/Reverse process)
Chain-of-Thought (CoT) Prompting
Autoregressive vs. Non-autoregressive generation

Key Terms

DoT: Diffusion-of-Thought—a method integrating Chain-of-Thought reasoning into the denoising process of diffusion models.

DoTMP: Diffusion-of-Thought Multi-Pass—a variant where the model generates one thought per diffusion process, using previous thoughts as conditions.

Plaid: A large-scale continuous diffusion language model (1.3B parameters) trained on OpenWebText.

SEDD: Score Entropy Discrete Diffusion—a discrete diffusion language model that operates directly on token indices.

Implicit CoT: A method where reasoning steps are performed in the hidden states of a transformer rather than outputted as text tokens.

Classifier-free guidance: A technique to control diffusion generation by mixing conditional and unconditional score estimates, used here to condition on the problem statement.

Self-consistency: A decoding strategy that samples multiple reasoning paths and selects the most frequent final answer.

Scheduled sampling: A training technique where the model is occasionally exposed to its own generated (potentially erroneous) outputs to improve robustness.

Coupled sampling: A training strategy for DoTMP where noise is added to prior correct thoughts during training to mimic inference-time errors.