visual tokens: Small patches of an image converted into vector embeddings that the language model processes like words
centrifugal paradigm: The paper's proposed strategy of selecting tokens starting from a central point of interest and expanding outward to neighbors
spatial sparsity: The physical distance between selected tokens in the 2D image grid
BSS: Buffering for Spatial Sparsity—a criterion that modifies similarity scores based on distance to prioritize selecting neighbors of existing tokens
SWA: Similarity-Weighted Aggregation—a method to merge discarded tokens into selected ones by weighted averaging based on similarity
pivot tokens: The initial set of tokens selected to represent distinct subjects, serving as anchors for the expansion process
max-min distance: A selection strategy that picks points that are as far away from each other as possible to maximize coverage
LLaVA: Large Language and Vision Assistant—a popular open-source VLM architecture
ViT: Vision Transformer—a neural network that processes images by splitting them into patches (tokens)