GoPT: Grouping Prompt Tuning—the proposed framework that uses semantic grouping to generate visual prompts for multi-modal adaptation.
CUP: Class-Aware Uni-Modal Prompter—a module that groups auxiliary modality tokens into semantic clusters to learn modality-specific prompts.
ACP: Alignment-Induced Cross-Modal Prompter—a module that aggregates the learned uni-modal prompts and injects them into the RGB backbone via cross-attention.
MAE: Masked Autoencoder—a self-supervised pre-training method for Vision Transformers used here as the foundation model initialization.
RGB-D: Red-Green-Blue plus Depth—a multi-modal format combining color images with distance information.
RGB-T: Red-Green-Blue plus Thermal—a multi-modal format combining color images with infrared thermal data.
RGB-SAR: Red-Green-Blue plus Synthetic Aperture Radar—a multi-modal format used in remote sensing.
Gumbel-Softmax: A reparameterization trick that allows differentiable sampling from a categorical distribution, used here to assign pixels to semantic groups.
mIoU: Mean Intersection over Union—a standard metric for semantic segmentation accuracy.
hard assignment: Assigning a token to exactly one group (one-hot) rather than a probability distribution, found to be more effective in this paper.