GRAFT: Ground Remote Alignment for Training—the proposed method of aligning satellite images to CLIP space via co-located ground images.
CLIP: Contrastive Language-Image Pre-training—a foundation model trained on internet image-text pairs that learns a shared embedding space for images and text.
ViT: Vision Transformer—a neural network architecture that processes images as sequences of patches using self-attention mechanisms.
NAIP: National Agriculture Imagery Program—high-resolution (1m/pixel) aerial imagery covering the continental United States.
Sentinel-2: A satellite mission providing lower resolution (10m/pixel) global optical imagery.
SAM: Segment Anything Model—a foundational image segmentation model that can generate masks from point prompts.
VQA: Visual Question Answering—the task of answering natural language questions about the visual content of an image.
geotag: Metadata embedded in an image file indicating the precise latitude and longitude where the photo was taken.
ViperGPT: A framework that uses Large Language Models to generate code (programs) that call vision APIs to answer visual queries.
zero-shot: The ability of a model to perform a task (like classifying a 'stadium') without having seen explicit labeled examples of that specific class during training.