ARMs: Autoregressive Models—models that generate text one token at a time from left to right
MDMs: Masked Diffusion Models—models that generate text by iteratively unmasking tokens in a fixed-length sequence
ILMs: Insertion Language Models—the proposed method that generates text by inserting tokens at arbitrary positions
DDiT: Diffusion Transformer architecture—a Transformer backbone with Adaptive Layer Norm used for diffusion models
AdaLN: Adaptive Layer Normalization—normalization layers that condition the model on timestep information
RoPE: Rotary Positional Embeddings—a method for encoding position information in Transformers
Zebra Puzzle: A constraint satisfaction logic puzzle requiring the assignment of attributes to entities based on clues
LM1B: One Billion Word Benchmark—a large text corpus used for language modeling evaluation
NLL: Negative Log-Likelihood—a metric measuring how well a model predicts the data (lower is better)
Entropy: A measure of the randomness or diversity in the generated text