Large Language Diffusion Models

📝 Paper Summary

Language Modeling Diffusion Models Generative AI

LLaDA is an 8B-parameter diffusion model trained from scratch that matches autoregressive baselines in scalability and capabilities, demonstrating that next-token prediction is not the only path to strong language models.

Core Problem

Autoregressive models (ARMs) dominate LLMs but suffer from inherent limitations like the 'reversal curse' (inability to reason backwards), and it is unproven whether core capabilities like in-context learning are unique to the autoregressive paradigm.

Why it matters:

The left-to-right generation order restricts models from handling tasks requiring bidirectional context or reversal reasoning
Establishing whether diffusion models can scale effectively offers a principled alternative for generative modeling beyond next-token prediction

Concrete Example: When asked to complete a poem in reverse or reason backwards, standard left-to-right models fail due to unidirectional dependencies. LLaDA successfully generates the 'The Road Not Taken' poem in reverse and outperforms GPT-4o in a reversal poem completion task.

Key Novelty

LLaDA (Large Language Diffusion with mAsking)

Apply Masked Diffusion Models (MDM) at the scale of modern LLMs (8B parameters, 2.3T tokens), replacing next-token prediction with a masking-and-recovery objective
Utilize a Transformer-based mask predictor that sees the entire sequence (bidirectional attention) during both training and inference
Demonstrate that standard LLM pipelines (Pre-training + SFT) work effectively for diffusion models without architectural changes like causal masking

Architecture

The training and inference pipeline of LLaDA. (a) Pre-training via forward masking and reverse prediction. (b) SFT masking only responses. (c) Inference via iterative sampling.

Evaluation Highlights

LLaDA 8B Base achieves 70.3% on GSM8K (4-shot), outperforming LLaMA3 8B (48.7%) and LLaMA2 7B (13.1%)
Matches LLaMA3 8B Base on MMLU (5-shot) with 65.9% vs 65.4%, demonstrating competitive knowledge capability
Scales effectively to 10^23 FLOPs, showing performance trends comparable to autoregressive baselines on tasks like MMLU and GSM8K

Breakthrough Assessment

9/10

Challenge the fundamental dominance of autoregressive modeling. It proves diffusion models can scale to 8B parameters and match SOTA ARMs on standard benchmarks, offering a viable alternative paradigm.

⚙️ Technical Details

Problem Definition

Setting: Generative modeling of discrete text sequences via maximizing a variational lower bound of the log-likelihood

Inputs: Masked text sequence x_t (where tokens are masked with probability t)

Outputs: Predicted original tokens for all masked positions simultaneously

Pipeline Flow

Forward Process (Data Masking)
Mask Predictor (Transformer)
Reverse Process (Iterative Denoising/Generation)

System Modules

Forward Process

Gradually corrupts data by masking tokens

Model or implementation: Stochastic process

Mask Predictor

Predicts the original identity of masked tokens given the context

Model or implementation: Transformer (8B parameters, bidirectional attention)

Reverse Process (Sampler)

Generates text by iteratively predicting and re-masking tokens

Model or implementation: Iterative algorithm (Low-confidence remasking)

Novel Architectural Elements

Application of non-causal (bidirectional) Transformers for generative language modeling at 8B scale
Unified masking strategy where masking ratio varies randomly between 0 and 1 during training (unlike BERT's fixed ratio)

Modeling

Base Model: LLaDA 8B (Custom Transformer architecture)

Training Method: Pre-training from scratch followed by Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Maximize log-likelihood lower bound.

Formally: Cross-entropy loss computed only on masked tokens: L = E[ - sum log p(x0 | xt) ]

Adaptation: Full parameter tuning during SFT

Trainable Parameters: 8 Billion

Training Data:

Pre-training: 2.3 Trillion tokens (code, math, multilingual, general text)
SFT: 4.5 Million instruction-response pairs

Key Hyperparameters:

learning_rate_pretrain: 4e-4 (decaying to 1e-5)
learning_rate_sft: 2.5e-5 (decaying to 2.5e-6)
batch_size_pretrain: 1280 (global)
+ 3 more
sequence_length: 4096
weight_decay: 0.1
optimizer: AdamW

Compute: 0.13 million H800 GPU hours for pre-training

Reproducibility

Code: https://ml-gsai.github.io/LLaDA-demo/

publicly available (https://ml-gsai.github.io/LLaDA-demo/). Code and model weights are available. Pre-training and SFT datasets are described but exact proprietary splits might not be released.

📊 Experiments & Results

Evaluation Setup

Zero/Few-shot evaluation on standard benchmarks and scalability analysis

Benchmarks:

MMLU (General Knowledge Understanding)
GSM8K (Grade School Math)
HumanEval (Code Generation)
Reversal Poem Completion (Reversal Reasoning)

Metrics:

Accuracy (%)
Pass@1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero/Few-shot performance comparisons of Pre-trained Base models showing LLaDA 8B's competitiveness with Autoregressive (ARM) models.
MMLU	Accuracy (5-shot)	45.9	65.9	+20.0
GSM8K	Accuracy (4-shot)	48.7	70.3	+21.6
HumanEval	Pass@1 (0-shot)	34.8	35.4	+0.6
CMMLU	Accuracy (5-shot)	50.7	69.9	+19.2
Results after Supervised Fine-Tuning (SFT) showing instruction following capabilities.
MMLU	Accuracy (5-shot)	68.4	65.5	-2.9
GSM8K	Accuracy (4-shot)	78.3	69.4	-8.9

Experiment Figures

Scaling laws: Performance (y-axis) vs Compute FLOPs (x-axis) for LLaDA vs ARM baselines on 6 tasks.

Main Takeaways

Scalability is not unique to autoregressive models; LLaDA scales linearly with compute budget similar to ARMs.
In-context learning and instruction-following emerge in diffusion models given sufficient scale and data, challenging the view that these are ARM-specific traits.
Diffusion models inherently handle bidirectional dependencies, allowing them to solve reversal tasks (like reverse poem completion) where ARMs fail.
LLaDA demonstrates that the generative modeling principle (optimizing likelihood lower bound), rather than the specific autoregressive formulation, underpins LLM capabilities.

📚 Prerequisite Knowledge

Prerequisites

Basics of Diffusion Models (Forward/Reverse processes)
Transformer architecture (Attention mechanisms)
Autoregressive Language Modeling (Next-token prediction)

Key Terms

MDM: Masked Diffusion Model—a generative model that adds noise by masking tokens and learns to reconstruct them

ARM: Autoregressive Model—standard language models that generate text one token at a time from left to right

SFT: Supervised Fine-Tuning—training a pre-trained model on labeled instruction-response pairs

Reversal Curse: The inability of autoregressive LLMs to generalize from 'A is B' to 'B is A' or generate text backwards due to their unidirectional training

FLOPs: Floating Point Operations—a measure of computational cost used here to analyze scaling laws

MMLU: Massive Multitask Language Understanding—a benchmark evaluating models on a wide range of subjects

GSM8K: Grade School Math 8K—a benchmark of high-quality grade school math word problems

In-context learning: The ability of a model to perform tasks based on examples provided in the prompt without parameter updates

KL divergence: Kullback-Leibler divergence—a statistical distance measure used in the loss function to align the model distribution with the data distribution