Matryoshka Representation Learning: A training technique that learns nested embeddings of different sizes (e.g., 64, 128, 256 dims) or token counts simultaneously, allowing the model to use any of these sizes during inference
Saliency: Visual features that stand out or attract attention (dominant semantics)
Anti-saliency: Subtle or weak visual features that are often overshadowed by dominant features but are necessary for detailed understanding
AvgPool: Average Pooling—a downsampling operation that calculates the average value of a feature map patch, acting as a low-pass filter
MaxPool: Maximum Pooling—a downsampling operation that takes the maximum value, capturing the most prominent features
Grad-CAM: Gradient-weighted Class Activation Mapping—a technique to visualize which parts of an image a deep learning model is looking at
Q-Former: A module from BLIP-2 that compresses visual features into a fixed number of learnable query tokens