DoT: Diffusion-of-Thought—a method integrating Chain-of-Thought reasoning into the denoising process of diffusion models.
DoTMP: Diffusion-of-Thought Multi-Pass—a variant where the model generates one thought per diffusion process, using previous thoughts as conditions.
Plaid: A large-scale continuous diffusion language model (1.3B parameters) trained on OpenWebText.
SEDD: Score Entropy Discrete Diffusion—a discrete diffusion language model that operates directly on token indices.
Implicit CoT: A method where reasoning steps are performed in the hidden states of a transformer rather than outputted as text tokens.
Classifier-free guidance: A technique to control diffusion generation by mixing conditional and unconditional score estimates, used here to condition on the problem statement.
Self-consistency: A decoding strategy that samples multiple reasoning paths and selects the most frequent final answer.
Scheduled sampling: A training technique where the model is occasionally exposed to its own generated (potentially erroneous) outputs to improve robustness.
Coupled sampling: A training strategy for DoTMP where noise is added to prior correct thoughts during training to mimic inference-time errors.