MDM: Masked Discrete Diffusion Model—a generative model that iteratively unmasks tokens starting from a fully masked sequence.
KV-cache: Key-Value cache—storing previous attention computations to avoid re-computing them, standard in autoregressive models but difficult in bidirectional ones.
Step-causal attention: A novel attention pattern where current tokens can attend to all previous steps' tokens (retrieved from cache) and current register tokens, but cached tokens cannot attend to new ones, enabling caching while allowing bidirectional interaction within the unmasked set.
Register tokens: Special learned tokens added to the sequence to represent the aggregate information of truncated (masked) tokens, compensating for the capacity loss when dropping masks.
Sparse parameterization: Representing the input sequence by only including prompt tokens, already decoded tokens, and the specific subset of masked tokens to be predicted this step, rather than the full dense sequence.
LaViDa-O: The dense baseline unified MDM model (10.4B parameters) upon which Sparse-LaViDa is built.