LLaDA2. 0: Scaling Up Diffusion Language Models to 100B

📝 Paper Summary

Discrete Diffusion Language Models (dLLM) Non-Autoregressive Generation Efficient Large Language Models

LLaDA2.0 successfully converts large-scale pre-trained autoregressive models into efficient 16B and 100B parameter diffusion language models using a progressive block-size training strategy.

Core Problem

Training large-scale diffusion language models (dLLMs) from scratch is prohibitively expensive, but directly converting pre-trained autoregressive (AR) models fails due to the fundamental distribution gap between sequential and bidirectional generation.

Why it matters:

Autoregressive models suffer from sequential inference bottlenecks, preventing parallel generation and increasing latency at scale.
Existing diffusion models are limited to small scales (≤8B), failing to match the frontier capabilities of 100B+ AR models.
Direct conversion without careful handling leads to catastrophic forgetting of the AR model's linguistic knowledge.

Concrete Example: When directly switching a standard AR model to a diffusion objective, the model often collapses because it cannot handle bidirectional context immediately. Additionally, training on packed sequences causes 'cross-document interference,' where the model confuses contexts from unrelated documents concatenated together.

Key Novelty

Warmup-Stable-Decay (WSD) Continual Pre-training

Transitions an AR model to a diffusion model by progressively increasing block size: starts with small blocks (Warmup), moves to full-sequence global diffusion (Stable), and reverts to compact blocks for efficient inference (Decay).
Uses a document-level attention mask to prevent spurious dependencies between unrelated documents in packed training sequences.
Integrates a confidence-aware loss during post-training to encourage the model to be 'sharper,' enabling more aggressive parallel decoding.

Architecture

The holistic training pipeline of LLaDA2.0, illustrating the transition from AR to MDLM and then to BDLM.

Evaluation Highlights

LLaDA2.0-flash (100B) achieves superior performance and efficiency compared to scratch-trained baselines, validating the AR-initialization strategy at scale.
Inference speedup is achieved via parallel decoding, surpassing equivalently sized AR models in throughput for large batches.
Post-training with SFT and DPO successfully aligns the diffusion model, yielding the LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B) instruction-tuned variants.

Breakthrough Assessment

9/10

First successful scaling of diffusion language models to the 100B parameter regime, bridging the gap with frontier AR models while enabling parallel decoding.

⚙️ Technical Details

Problem Definition

Setting: Discrete Masked Diffusion Language Modeling (MDLM) and Block Diffusion Language Modeling (BDLM)

Inputs: Masked sequence x_t (subset of tokens replaced with [MASK])

Outputs: Reconstructed original tokens x_0 for the masked positions

Pipeline Flow

Input Sequence (masked according to schedule)
Document-Level Attention Masking
Transformer Backbone (MoE)
Prediction Head (reconstructs masked tokens)

System Modules

Input Processor

Masks input tokens based on the diffusion timestep t and block configuration

Model or implementation: Deterministic masking logic

Attention Mechanism

Computes self-attention with specific masks to prevent cross-document contamination

Model or implementation: Modified Self-Attention with Document-Level Mask

Denoising Head

Predicts original identities of masked tokens

Model or implementation: Linear projection

Novel Architectural Elements

Document-level attention mask for diffusion: A specialized mask M that enforces locality within documents while allowing bidirectional context for diffusion and causal context for history blocks.
Warmup-Stable-Decay (WSD) training schedule: A structural curriculum that dynamically alters the model's receptive field (block size) during training.

Modeling

Base Model: Ling-mini-2.0 (16B) and Ling-flash-2.0 (100B) [AR models]

Training Method: Continual Pre-training (CPT) followed by SFT and DPO

Objective Functions:

Purpose: Reconstruct masked tokens within blocks.

Formally: L_BDLM = E[ sum(weight * CrossEntropy(x_0, x_t)) ] over masked indices.
Purpose: Align model with human preferences using preference pairs.

Formally: DPO objective reformulated over the reconstruction loss of the diffusion model.
Purpose: Encourage high confidence for parallel decoding.

Formally: Auxiliary confidence prediction loss (details implicit in text).

Adaptation: Full parameter update with Top-k checkpoint merging

Training Data:

Packed heterogeneous documents for CPT
Instruction tuning datasets for SFT
Preference pairs for DPO

Key Hyperparameters:

block_size_schedule: 1 -> 4 -> 32 -> 64 -> 4096 (Warmup) -> 4096 (Stable) -> 32 (Decay)
final_block_size: 32 (for inference efficiency)
max_sequence_length: 4096

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLaDA/LLaDA-MoE: LLaDA2.0 scales to 100B by initializing from AR models rather than training from scratch.
vs. DiffusionLLaMA/RND1: Uses WSD (Warmup-Stable-Decay) strategy to progressively adapt block sizes, rather than just mask annealing or direct conversion.
vs. SDAR [cited in paper]: Scales beyond 30B parameters and uses a decay phase to optimize for block-based inference efficiency.

Limitations

Inference speed still depends on the number of diffusion steps, though mitigated by block diffusion.
Requires a high-quality pre-trained AR checkpoint as initialization.
Computational cost of CPT is significant, although less than training from scratch.

Reproducibility

Code: https://hf.co/collections/inclusionAI/llada-20

Code and models are publicly released at https://hf.co/collections/inclusionAI/llada-20. The paper describes the WSD strategy and attention masks in detail. Training compute (GPU hours) is not explicitly reported.

📊 Experiments & Results

Evaluation Setup

General language understanding and generation benchmarks.

Benchmarks:

General NLP Tasks (Various (exact benchmarks implied but specific names like MMLU/GSM8K not explicitly tabulated in text provided))

Metrics:

Performance (implied downstream task scores)
Inference Efficiency (Speed/Latency)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Parameter Count	Size	8000000000	100000000000	+92000000000

Main Takeaways

WSD strategy effectively bridges the gap between AR and Diffusion training, preventing collapse.
Document-level attention masks are critical for stable training on packed sequences.
Post-training (SFT+DPO) is viable for diffusion models and produces instruction-following capabilities.
Top-k checkpoint merging improves final model robustness.

📚 Prerequisite Knowledge

Prerequisites

Autoregressive (AR) language modeling
Discrete Diffusion / Masked Language Modeling (MLM)
Mixture-of-Experts (MoE) architecture

Key Terms

dLLM: Discrete Diffusion Large Language Model—a generative model that creates text by iteratively denoising a random sequence rather than predicting the next token.

MDLM: Masked Diffusion Language Model—a specific type of dLLM that learns to reconstruct randomly masked tokens.

BDLM: Block Diffusion Language Model—generates text in contiguous blocks; within a block, tokens are generated via diffusion, while blocks may be generated sequentially.

AR: Auto-regressive—models that generate text one token at a time, strictly left-to-right (e.g., GPT-4).

SFT: Supervised Fine-Tuning—training on instruction-response pairs to teach the model to follow commands.

DPO: Direct Preference Optimization—an alignment method that optimizes the model to prefer human-chosen responses over rejected ones without a separate reward model.

WSD: Warmup-Stable-Decay—the proposed three-phase training schedule: gradually increasing block size, training on full sequences, then decreasing block size.

KV-cache: Key-Value cache—storing attention computations for past tokens to speed up sequential generation.

Top-k checkpoint merging: Averaging the weights of the k best-performing model checkpoints to improve stability and generalization.