Evaluation Setup
Fine-tuning pre-trained Swin Transformers on various downstream visual recognition tasks.
Benchmarks:
- MS COCO (Instance Segmentation)
- ADE20K (Semantic Segmentation)
- Pascal VOC 0712 (Object Detection)
- DOTA / STAR (Oriented Object Detection)
- Oxford 102 Flower / Oxford-IIIT Pet / VOC 2007 (Image Classification)
Metrics:
- APbox (Bounding Box Average Precision)
- APmask (Mask Average Precision)
- mIoU (Mean Intersection over Union)
- Top-1 Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| COCO Instance Segmentation results demonstrate Mona's ability to surpass full fine-tuning on a highly competitive dense prediction benchmark. |
| MS COCO |
APbox |
47.2 |
48.2 |
+1.0
|
| MS COCO |
APmask |
40.9 |
41.8 |
+0.9
|
| Object Detection and Semantic Segmentation results showing consistent superiority over full fine-tuning and other PEFT methods. |
| Pascal VOC 0712 |
APbox |
82.5 |
86.1 |
+3.6
|
| ADE20K |
mIoU |
50.15 |
50.33 |
+0.18
|
| Oriented Object Detection results on remote sensing datasets. |
| DOTA |
APbox |
73.23 |
73.57 |
+0.34
|
| STAR |
APbox |
29.9 |
31.2 |
+1.3
|
| Image Classification results showing Mona also performs well on simpler tasks. |
| Flowers102 |
Top-1 Acc |
97.40 |
99.49 |
+2.09
|
Main Takeaways
- Mona consistently outperforms Full Fine-Tuning across diverse visual tasks (Detection, Segmentation, Classification), challenging the assumption that Full FT is the upper bound.
- The 'multi-cognitive' design (multi-scale convolutions) is crucial for dense prediction tasks, where spatial context matters more than in classification or NLP.
- Performance gains are achieved with significantly fewer updated backbone parameters compared to full fine-tuning (e.g., typically <10% of backbone params updated).
- Unlike LoRA, which struggles to match Full FT in complex vision tasks (COCO/ADE20K), Mona succeeds, suggesting architecture (convolution vs linear) is key for visual PEFT.