ViT: Vision Transformer—a neural network architecture for image processing based on the Transformer mechanism originally designed for NLP.
OOD: Out-of-Distribution—data that differs significantly from the training distribution (e.g., sketches vs. photos).
AttnLRP: Attention-aware Layer-wise Relevance Propagation—an interpretability method specifically designed to trace relevance through Transformer attention layers faithfully.
Spurious Correlations: Patterns in data (like background grass for a cow) that are predictive in the training set but do not essentially define the class.
GroundedSAM: A model combining Grounding DINO (text-to-box) and SAM (Segment Anything Model) to generate segmentation masks from text prompts.
VLM: Vision-Language Model—a model capable of processing and relating both image and text inputs.
IoU: Intersection over Union—a metric for evaluating segmentation overlap.
LRP: Layer-wise Relevance Propagation—a technique for determining which pixels contributed most to a neural network's decision.