Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models

📝 Paper Summary

Masked Discrete Diffusion Models (MDMs) Efficient Inference Multimodal Generation and Understanding

Sparse-LaViDa accelerates Masked Discrete Diffusion Models by dynamically truncating redundant masked tokens during inference and using specialized register tokens to maintain generation quality.

Core Problem

Existing Masked Diffusion Models (MDMs) are inefficient because they must process the full sequence of tokens (including redundant masks) at every sampling step and cannot use KV-caching due to bidirectional attention requirements.

Why it matters:

MDMs offer advantages like parallel decoding and bidirectional context but are computationally expensive compared to autoregressive models.
Processing thousands of redundant masked tokens (e.g., 1024 tokens for an image when only a few are unmasked) wastes significant compute.
Prior acceleration methods like Block Diffusion enforce left-to-right ordering, sacrificing the bidirectional context needed for image editing and inpainting.

Concrete Example: If an image is represented by 1024 tokens, a standard MDM processes all 1024 tokens at every diffusion step, even if only a small subset is being unmasked. Sparse-LaViDa processes only the prompt, previously generated tokens, and the specific subset of masked tokens currently being decoded.

Key Novelty

Sparse Parameterization with Step-Causal Masking

Represents partially masked sequences sparsely: instead of materializing all masked tokens, the model only inputs tokens relevant to the current step (prompt + generated + current target masks).
Uses 'register tokens' as compact summaries of the truncated masked regions to prevent loss of model capacity.
Introduces a 'step-causal' attention mask that allows KV-caching (like AR models) while preserving the bidirectional context required for image tasks (unlike Block Diffusion).

Architecture

Comparison of inference paradigms: (Left) Standard MDM with full dense attention, (Middle) Block Diffusion with left-to-right causal masking, (Right) Sparse-LaViDa with sparse inputs and step-causal attention.

Evaluation Highlights

Achieves 1.95x speedup on text-to-image generation (21.27s vs 10.86s) while maintaining comparable generation quality to LaViDa-O.
Achieves 2.83x speedup on image editing tasks while improving accuracy (+0.08 on ImgEdit benchmark).
Maintains strong performance on visual math reasoning (MathVista) with a 2.80x speedup compared to the dense baseline.

Breakthrough Assessment

8/10

Significant efficiency gains (approx 2-3x) for MDMs without sacrificing quality or bidirectional capabilities. addresses the primary bottleneck preventing MDMs from competing with AR models in speed.

⚙️ Technical Details

Problem Definition

Setting: Unified multimodal generation and understanding using discrete diffusion

Inputs: Prompt tokens p and a partially masked sequence of response tokens X_t

Outputs: Predicted clean tokens X_0 or logits for specific unmasked positions

Pipeline Flow

Input Processing (Prompt + current subset of masked tokens + Register Tokens)
Transformer Backbone (utilizing Step-Causal Attention & KV Cache)
Token Prediction (Logits for current subset)
Sampler (Decides next subset to unmask)

System Modules

Input Processor

Constructs sparse input sequence containing only prompt, previously decoded tokens, register tokens, and current target masked tokens.

Model or implementation: Based on LaViDa-O tokenizer/embeddings

Backbone

Processes sparse inputs to predict original values of masked tokens.

Model or implementation: 10.4B Parameter Transformer (LaViDa-O weights)

Sampler

Determines which tokens to unmask next.

Model or implementation: Stratified random sampler (images) or Confidence-based (text)

Novel Architectural Elements

Sparse input formulation that physically excludes non-target masked tokens from the forward pass.
Integration of learnable register tokens to proxy for truncated sequence length/capacity.
Step-causal attention masking mechanism designed to make bidirectional MDMs compatible with KV-caching.

Modeling

Base Model: LaViDa-O (10.4B parameters)

Training Method: Supervised Fine-Tuning (SFT) with sparse parameterization

Objective Functions:

Purpose: Minimize negative log-likelihood of clean tokens given masked context.

Formally: Standard MDM objective (Eq. 1 in paper), but computed only over the sparse subset of tokens.

Training Data:

Subsample of LaViDa-O SFT data
20M text-image pairs (LAION-2B, COYO-700M, etc.)
Image understanding data (MAmmoth-VL)
Image editing data (GPT-Edit-1.5M)

Key Hyperparameters:

training_steps: 100k
gpus: 64 NVIDIA H100
register_tokens: 64

Compute: Training takes 100k steps on 64 H100s. Inference speedup is approx 2-3x over dense baseline.

Comparison to Prior Work

vs. Block Diffusion: Sparse-LaViDa preserves bidirectional context (crucial for editing/inpainting) via step-causal masking, whereas Block Diffusion is strictly semi-autoregressive.
vs. Fast-dLLM: Sparse-LaViDa is a training-based method with a mathematically consistent formulation, avoiding the unpredictable quality degradation of heuristic caching [not cited in paper as primary comparison but mentioned].
vs. LaViDa-O (Base): Sparse-LaViDa introduces sparse parameterization and register tokens to reduce compute per step significantly.

Limitations

Speedups are minimal for short QA tasks where output length is less than one block size (32 tokens).
Requires fine-tuning (SFT) to adapt the dense model to the sparse parameterization; not a plug-and-play inference optimization.
Register tokens are necessary; removing them degrades performance, indicating the model relies on them to maintain capacity.

Reproducibility

Code: https://github.com/TencentARC/LaViDa

Code is publicly available at https://github.com/TencentARC/LaViDa. Built on open weights of LaViDa-O. Training data is a filtered subset of open datasets (LAION, COYO, etc.).

📊 Experiments & Results

Evaluation Setup

Multimodal tasks including text-to-image generation, image editing, and visual understanding.

Benchmarks:

GenEval (Text-to-Image Generation)
DPG-bench (Text-to-Image Generation (Prompt Alignment))
MJHQ-30k (Text-to-Image Generation (Quality))
ImgEdit (Image Editing)
MathVista (Visual Math Reasoning)

Metrics:

Overall Score (GenEval)
FID (Fréchet Inception Distance)
HPS v2/v3 (Human Preference Score)
End-to-end Latency (seconds)
Accuracy (ImgEdit, MathVista)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GenEval	Overall Score	0.72	0.73	+0.01
GenEval	Latency (s)	21.27	10.86	-10.41
ImgEdit	Accuracy	0.58	0.66	+0.08
ImgEdit	Latency (s)	63.98	22.55	-41.43
MathVista	Latency (s)	39.98	14.07	-25.91
GenEval	Overall Score	0.71	0.73	+0.02

Experiment Figures

Radar chart comparing Sparse-LaViDa vs LaViDa-O on various metrics (Speedup, Quality, etc.)

Illustration of the Step-Causal Attention Mask used during training.

Main Takeaways

Sparse-LaViDa consistently accelerates inference by ~2-3x across diverse multimodal tasks (generation, editing, reasoning).
Generation quality is maintained or even slightly improved (e.g., lower FID, higher HPS) compared to the dense baseline, likely due to effective training with step-causal masks.
The combination of token truncation and KV-caching is synergistic; ablations show that using both yields significantly higher speedups than either alone.
Register tokens are critical for maintaining fine-grained visual details and prompt alignment, compensating for the information loss from truncating masked tokens.

📚 Prerequisite Knowledge

Prerequisites

Discrete Diffusion Models
Masked Generative Modeling (BERT/MAE style)
KV-Caching in Transformers
Attention mechanisms (Causal vs. Bidirectional)

Key Terms

MDM: Masked Discrete Diffusion Model—a generative model that iteratively unmasks tokens starting from a fully masked sequence.

KV-cache: Key-Value cache—storing previous attention computations to avoid re-computing them, standard in autoregressive models but difficult in bidirectional ones.

Step-causal attention: A novel attention pattern where current tokens can attend to all previous steps' tokens (retrieved from cache) and current register tokens, but cached tokens cannot attend to new ones, enabling caching while allowing bidirectional interaction within the unmasked set.

Register tokens: Special learned tokens added to the sequence to represent the aggregate information of truncated (masked) tokens, compensating for the capacity loss when dropping masks.

Sparse parameterization: Representing the input sequence by only including prompt tokens, already decoded tokens, and the specific subset of masked tokens to be predicted this step, rather than the full dense sequence.

LaViDa-O: The dense baseline unified MDM model (10.4B parameters) upon which Sparse-LaViDa is built.