MLLM: Multimodal Large Language Model—an AI system capable of processing and generating both text and images
ViT: Vision Transformer—a model architecture that processes images as sequences of patches (tokens) using self-attention
Instruction Following: The ability of a model to precisely adhere to constraints in a prompt (e.g., 'respond in JSON', 'limit to 10 words')
Token Compression: Reducing the number of tokens representing an image to decrease computational cost and redundancy
Spatial Down-sampling: A naive method of reducing image tokens by simply pooling or skipping spatial patches, often leading to information loss
Attention Inhibition: Selectively suppressing (masking) attention weights between specific token pairs to prevent the model from focusing on irrelevant information
K-Means: A clustering algorithm used here to group semantically similar redundant visual tokens before merging them
Causal Mask: A mask used in autoregressive language models to ensure predictions only depend on previous tokens; modified here to inhibit attention to specific image tokens