LLaDA2.0: Scaling Up Diffusion Language Models to 100B

📝 Paper Summary

Discrete Masked Diffusion Language Models (MDLM) Block Diffusion Language Models (BDLM) Large Language Model Pre-training

LLaDA2.0 scales diffusion language models to 100 billion parameters by converting pre-trained auto-regressive models via a three-phase Warmup-Stable-Decay strategy that balances knowledge inheritance with bidirectional diffusion capabilities.

Core Problem

Training large-scale diffusion language models from scratch is prohibitively expensive, but direct conversion from standard auto-regressive models fails due to the distribution gap between left-to-right generation and bidirectional denoising.

Why it matters:

Auto-regressive models suffer from sequential inference bottlenecks, preventing parallel generation and increasing latency at scale
Existing diffusion models are limited to small scales (≤8B) due to training costs, failing to match the performance of 100B+ frontier models
Bridging the gap allows diffusion models to inherit the vast knowledge of existing pre-trained AR models while enabling fast parallel decoding

Concrete Example: When directly training a diffusion model initialized from an AR model, the mismatch between the AR's causal attention and the diffusion model's bidirectional requirement leads to unstable optimization and catastrophic forgetting of linguistic knowledge. LLaDA2.0 avoids this by gradually transitioning block sizes.

Key Novelty

Warmup-Stable-Decay (WSD) Continual Pre-training

Warmup: Gradually increases the block size in Block Diffusion from small spans to the full sequence, allowing the AR model to slowly adapt to bidirectional diffusion context
Stable: Trains on full-sequence masked diffusion (MDLM) at large scale to solidify global denoising capabilities
Decay: Reverts to a compact block size to distill global knowledge back into a blockwise structure optimized for efficient KV-cache reuse during inference

Architecture

The holistic training pipeline of LLaDA2.0, showing the transition from AR to Diffusion.

Evaluation Highlights

LLaDA2.0-flash (100B) enables parallel decoding, surpassing the inference speed of equivalently sized auto-regressive models
Successfully scales diffusion language models to 100B parameters (LLaDA2.0-flash) and 16B parameters (LLaDA2.0-mini) via continual pre-training
Achieves competitive performance on standard benchmarks by inheriting knowledge from strong AR base models (Ling-mini-2.0 and Ling-flash-2.0)

Breakthrough Assessment

8/10

First successful scaling of diffusion language models to the 100B parameter frontier. The WSD strategy offers a practical recipe for converting existing AR models to diffusion, potentially shifting the paradigm for efficient large-scale inference.

⚙️ Technical Details

Problem Definition

Setting: Generative Language Modeling via Discrete Diffusion

Inputs: Masked sequence tokens x_t (corrupted from x_0)

Outputs: Predicted original tokens for the masked positions

Pipeline Flow

Input Processing (Masking)
Block-wise Diffusion Denoising
Iterative Refinement (Inference)
Block Auto-regression (if BDLM mode)

System Modules

Backbone Transformer

Predict original tokens for masked positions based on context

Model or implementation: Based on Ling-mini-2.0 (16B MoE) and Ling-flash-2.0 (100B MoE)

Confidence Predictor

Auxiliary head to predict generation confidence

Model or implementation: Auxiliary loss head

Novel Architectural Elements

Warmup-Stable-Decay (WSD) training scheduler involving dynamic block sizing to bridge AR and Diffusion architectures
Block-wise document-level attention mask to prevent cross-document contamination during bidirectional training
Confidence-aware parallel decoding mechanism enabled by auxiliary confidence loss

Modeling

Base Model: Ling-mini-2.0 (16B MoE) and Ling-flash-2.0 (100B MoE)

Training Method: Continual Pre-training (CPT) followed by SFT and DPO

Objective Functions:

Purpose: Reconstruct masked tokens within blocks.

Formally: Cross-entropy loss on masked tokens, weighted by diffusion time-step.
Purpose: Align model with human preferences.

Formally: DPO adapted for reconstruction loss.
Purpose: Enhance model certainty for parallel decoding.

Formally: Auxiliary confidence prediction loss.

Adaptation: Full model update (continual pre-training)

Training Data:

Packed heterogeneous documents with document-level masking

Key Hyperparameters:

block_size_schedule: Increases from 1 to 4096 (Warmup), stays at 4096 (Stable), decreases to 32 (Decay)
merging_strategy: Top-k checkpoint averaging

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLaDA/LLaDA-MoE: LLaDA2.0 scales to 100B via initialization from AR rather than scratch training
vs. DiffusionLLaMA: Uses WSD (block size manipulation) instead of mask annealing or CART loss reweighting
vs. SDAR: Scales significantly larger (100B vs ~30B) and introduces the Stable phase (full MDLM training) between block size shifts

Limitations

Inference speed benefits rely on parallel decoding which can be quality-dependent
Requires a high-quality pre-trained AR model as a starting point
Specifics of the dataset used for continual pre-training are not detailed
No statistical significance tests reported for the performance comparisons

Reproducibility

Code: https://hf.co/collections/inclusionAI/llada-20

Models LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B) are open-sourced on Hugging Face. The paper describes the WSD strategy and attention masking in detail. Exact training compute (GPU hours) and specific dataset compositions are not detailed in the text.

📊 Experiments & Results

Evaluation Setup

General language understanding and instruction following

Benchmarks:

Not explicitly listed by name in summary text (General NLP tasks)

Metrics:

Performance (implied downstream task accuracy/score)
Inference Efficiency (Speed/Latency)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

LLaDA2.0-flash (100B) demonstrates that diffusion models can be scaled to frontier sizes by leveraging AR initialization.
The WSD (Warmup-Stable-Decay) strategy effectively bridges the distribution gap between AR and Diffusion objectives, preventing catastrophic forgetting.
The document-level attention mask is critical for training stability when using packed sequences in bidirectional models.
Post-training with complementary masking and confidence loss allows the model to be efficient in parallel decoding while maintaining alignment.

📚 Prerequisite Knowledge

Prerequisites

Auto-regressive (AR) Language Modeling
Discrete Diffusion / Masked Diffusion Language Models (MDLM)
Block Diffusion
Continual Pre-training (CPT)
KV-Cache

Key Terms

MDLM: Masked Diffusion Language Model—generates text by iteratively refining a sequence where tokens are randomly masked, allowing bidirectional context usage

BDLM: Block Diffusion Language Model—a hybrid approach where tokens are generated in blocks; diffusion is applied within blocks while blocks are generated auto-regressively

WSD: Warmup-Stable-Decay—the proposed three-phase training schedule to convert AR models to Diffusion models by manipulating block size

SFT: Supervised Fine-Tuning—training on instruction-response pairs to teach the model to follow user commands

DPO: Direct Preference Optimization—an alignment algorithm that optimizes the model to prefer higher-quality responses over lower-quality ones without a separate reward model

KV-cache: Key-Value cache—storing attention computations for previous tokens to speed up future generation steps; typically hard in diffusion but enabled here via Block Diffusion

Top-k checkpoint merging: Averaging the parameters of the k best-performing checkpoints to improve generalization and stability

document-level attention mask: A masking technique that restricts attention to within individual documents when multiple short documents are packed into one training sequence