ViPT: Visual Prompt multi-modal Tracking—the proposed framework
MCP: Modality-Complementary Prompter—a lightweight block inserted into the frozen backbone to generate prompts from auxiliary modalities
RGB-D: Red-Green-Blue + Depth modality
RGB-T: Red-Green-Blue + Thermal modality
RGB-E: Red-Green-Blue + Event modality
Foundation Model: A large-scale pre-trained model (here, an RGB tracker based on ViT) used as the starting point
Prompt-tuning: Freezing the main model and optimizing only a small set of added parameters (prompts) to adapt to a new task
Full fine-tuning: Updating all parameters of a pre-trained model on a downstream dataset
EAO: Expected Average Overlap—a primary metric for VOT challenges measuring both accuracy and robustness
OSTrack: The specific RGB-based foundation tracker used in this paper (One-Stream Transformer Tracking)
Spatial Fovea: An operation within the MCP block that applies a spatial attention mask to focus on salient regions
t-SNE: t-Distributed Stochastic Neighbor Embedding—a technique for dimensionality reduction used to visualize feature clusters