Visual Prompt Multi-Modal Tracking

📝 Paper Summary

Visual Object Tracking Multi-Modal Learning Prompt Learning

ViPT adapts a frozen, pre-trained RGB tracking foundation model to multi-modal tasks by learning lightweight modality-complementary prompts instead of full fine-tuning.

Core Problem

Multi-modal tracking (RGB + Depth/Thermal/Event) lacks large-scale training data compared to RGB tracking, making full fine-tuning of foundation models prone to overfitting and parameter inefficiency.

Why it matters:

Full fine-tuning is storage-heavy, requiring a separate copy of the large foundation model for each downstream task
Scarcity of multi-modal data (e.g., DepthTrack has ~0.2M frames vs. TrackingNet's 14M) limits the generalization of fully tuned models
Existing methods often design complex extra branches for auxiliary modalities, increasing architectural complexity

Concrete Example: In RGB-Thermal tracking, a standard approach builds a two-stream network and retrains all parameters. Due to limited thermal data, the model might overfit the small dataset and lose the robust feature extraction capabilities learned from massive RGB datasets.

Key Novelty

Visual Prompt Multi-modal Tracking (ViPT)

Freeze the entire pre-trained RGB foundation model to preserve general visual knowledge
Introduce sparse 'Modality-Complementary Prompter' (MCP) blocks that inject auxiliary modality information (Depth/Thermal/Event) into the frozen backbone
Learn only the prompt parameters (<1% of total), allowing the model to adapt to new modalities while maintaining the robustness of the foundation model

Architecture

Overview of ViPT architecture and detailed Modality-Complementary Prompter (MCP) design.

Evaluation Highlights

Achieves state-of-the-art 59.4% F-score on DepthTrack (RGB-D), surpassing the foundation model by +6.5%
Outperforms the runner-up by +10.5% in Success Rate on the LasHeR benchmark (RGB-T) while using <1% trainable parameters
Beats full fine-tuning (FFT) paradigms across all tasks despite having two orders of magnitude fewer trainable parameters (0.84M vs 178.6M)

Breakthrough Assessment

8/10

Successfully introduces prompt learning to multi-modal tracking, demonstrating that fine-tuning <1% of parameters can outperform full fine-tuning. A highly efficient and effective adaptation strategy.

⚙️ Technical Details

Problem Definition

Setting: Single Object Tracking given multi-modal inputs (RGB + Auxiliary)

Inputs: RGB frames X_RGB and synchronized auxiliary frames X_A (Depth, Thermal, or Event)

Outputs: Bounding box B of the target in search frames

Pipeline Flow

Input Embedding (RGB & Auxiliary)
Frozen Transformer Encoder (RGB features)
Modality-Complementary Prompter (Fusion & Prompt Generation)
Box Prediction Head

System Modules

Patch Embed

Project images to token sequences

Model or implementation: Linear projection / Convolution

Foundation Encoder

Extract visual features using pre-trained knowledge

Model or implementation: ViT-Base (from OSTrack), Frozen

Modality-Complementary Prompter (MCP)

Learn effective visual prompts by fusing intermediate foundation features with auxiliary features

Model or implementation: Lightweight Conv blocks + Attention

Box Head

Predict target bounding box

Model or implementation: Corner predictor (from OSTrack), Frozen

Novel Architectural Elements

Modality-Complementary Prompter (MCP): A side network that interacts with the frozen backbone at multiple stages to generate prompts from auxiliary modalities
Residual Prompt Injection: Learned prompts are added to the RGB tokens of the frozen foundation model (H_l = H_{RGB} + P_{l+1})

Modeling

Base Model: OSTrack (based on ViT-Base backbone)

Training Method: Visual Prompt Tuning (updating only MCP parameters and auxiliary embeddings)

Objective Functions:

Purpose: Classification of target vs background.

Formally: Weighted focal loss L_cls
Purpose: Bounding box regression.

Formally: L_1 loss + Generalized IoU loss L_iou

Adaptation: Prompt tuning (<1% parameters updated)

Trainable Parameters: 0.84M trainable parameters out of 93.36M total (approx 0.9%)

Training Data:

RGB-D: DepthTrack train set
RGB-T: LasHeR train set
RGB-E: VisEvent train set

Key Hyperparameters:

global_batch_size: 64
epochs: 60
optimizer: AdamW
+ 4 more
weight_decay: 1e-4
initial_learning_rate: 4e-5
lr_scheduler: Decrease by factor of 10 after 48 epochs
initialization: Xavier uniform for prompt parameters

Compute: Training on 2 NVIDIA Tesla A100 GPUs

Comparison to Prior Work

vs. ProTrack: ViPT uses *learnable* prompts via a specific network (MCP) and fine-tunes them, whereas ProTrack uses fixed operations.
vs. DeT/Full Fine-Tuning: ViPT freezes the backbone and adds <1% parameters, whereas DeT fine-tunes the whole model.
vs. VPT: ViPT introduces auxiliary modal inputs into the prompt generation process (Modality-Complementary Prompter), whereas VPT learns static vector prompts.
+ 1 more
vs. OSTrack: ViPT extends OSTrack to multi-modal settings via prompting.

Limitations

Currently requires separate training for each multi-modal task (RGB-D, RGB-T, RGB-E) rather than a single unified model
Focuses only on visual prompting; potential for vision-language extension is unexploited
Performance depends on the quality of the pre-trained RGB foundation model

Reproducibility

Code: https://github.com/jiawen-zhu/ViPT

Publicly available code and models at https://github.com/jiawen-zhu/ViPT. Uses standard datasets (DepthTrack, LasHeR, VisEvent). Initialization uses pre-trained OSTrack weights.

📊 Experiments & Results

Evaluation Setup

Short-term Single Object Tracking on multi-modal benchmarks

Benchmarks:

DepthTrack (RGB-D Tracking)
VOT-RGBD2022 (RGB-D Tracking)
LasHeR (RGB-T Tracking)
RGBT234 (RGB-T Tracking)
VisEvent (RGB-Event Tracking)

Metrics:

F-score (F)
Precision (Pr)
Recall (Re)
Expected Average Overlap (EAO)
Success Rate (SR)
Maximum Success Rate (MSR)
Maximum Precision Rate (MPR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RGB-D Tracking Results: ViPT outperforms previous state-of-the-art and the foundation model on major RGB-D benchmarks.
DepthTrack	F-score	0.529	0.594	+0.065
VOT-RGBD2022	EAO	0.676	0.721	+0.045
RGB-T Tracking Results: Significant gains in thermal tracking, demonstrating effective modality fusion.
LasHeR	Success Rate (SR)	0.420	0.525	+0.105
RGBT234	MPR	0.823	0.835	+0.012
Comparison with Full Fine-Tuning (FFT) validates parameter efficiency.
LasHeR	Success Rate (SR)	51.7	52.5	+0.8
DepthTrack	F-score	55.6	59.4	+3.8

Experiment Figures

Success and Precision plots on LasHeR (RGB-T) dataset.

Visualization of response maps and t-SNE embeddings.

Main Takeaways

Prompt-tuning (ViPT) consistently outperforms Full Fine-Tuning (FFT) across RGB-D, RGB-T, and RGB-E tasks, despite using ~0.5% of the trainable parameters.
The Modality-Complementary Prompter (MCP) is crucial; standard VPT (Visual Prompt Tuning) methods without the specific auxiliary fusion design perform significantly worse (e.g., Prompt-deep is lower than ViPT-deep).
Increasing the number of prompt blocks (inserted more frequently in the backbone) correlates with improved performance.
Simply expanding training data or unfreezing more parameters does not necessarily yield better results than the proposed parameter-efficient approach, likely due to data scarcity in downstream tasks.

📚 Prerequisite Knowledge

Prerequisites

Visual Object Tracking (Siamese networks, Transformers)
Vision Transformers (ViT)
Prompt Learning / Parameter-Efficient Fine-Tuning

Key Terms

ViPT: Visual Prompt multi-modal Tracking—the proposed framework

MCP: Modality-Complementary Prompter—a lightweight block inserted into the frozen backbone to generate prompts from auxiliary modalities

RGB-D: Red-Green-Blue + Depth modality

RGB-T: Red-Green-Blue + Thermal modality

RGB-E: Red-Green-Blue + Event modality

Foundation Model: A large-scale pre-trained model (here, an RGB tracker based on ViT) used as the starting point

Prompt-tuning: Freezing the main model and optimizing only a small set of added parameters (prompts) to adapt to a new task

Full fine-tuning: Updating all parameters of a pre-trained model on a downstream dataset

EAO: Expected Average Overlap—a primary metric for VOT challenges measuring both accuracy and robustness

OSTrack: The specific RGB-based foundation tracker used in this paper (One-Stream Transformer Tracking)

Spatial Fovea: An operation within the MCP block that applies a spatial attention mask to focus on salient regions

t-SNE: t-Distributed Stochastic Neighbor Embedding—a technique for dimensionality reduction used to visualize feature clusters