UMM: Unified Multimodal Model—a single model capable of generating both text and images in one autoregressive stream
Event Bottleneck: The finding that effective context length is limited by the number of distinct visual events (images) rather than the raw number of tokens
Active Pollution: A failure mode where historical visual tokens spuriously match current queries and 'hijack' the attention budget, actively corrupting the output
Passive Dilution: A failure mode common in text, where relevant information is simply lost or outweighed by noise, leading to vague outputs
KV Cache: Key-Value Cache—memory storing pre-computed attention representations of past tokens to speed up generation
Softmax: A mathematical function used in attention that normalizes scores into probabilities; can exponentially amplify spurious outliers
VAE: Variational Autoencoder—used here to compress images into latent tokens for generation
ViT: Vision Transformer—used here to extract semantic features from images
Tail-risk hijacking: When rare, high-similarity outlier tokens in the history capture a disproportionate amount of attention due to Softmax amplification