Evaluation Setup
Mobile UI navigation and question answering across seen and unseen applications.
Benchmarks:
- Auto-UI (Page Navigation (Android))
- ScreenQA (Visual Question Answering)
- Self-Navigation (Page Navigation (Mobile3M subset)) [New]
Metrics:
- Action Accuracy
- F1* (Improved F1 for OCR/VQA)
- IoU (Intersection over Union)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance on downstream fine-tuning tasks (Stage 3) showing MobileVLM's superiority over general and specialized baselines. |
| ScreenQA |
Accuracy |
51.31 |
65.65 |
+14.34
|
| Auto-UI |
Action Accuracy |
62.51 |
65.29 |
+2.78
|
| Self-Navigation |
IoU |
14.31 |
48.49 |
+34.18
|
| Ablation studies validating the multi-stage training strategy. |
| Self-Navigation |
IoU |
35.89 |
48.49 |
+12.60
|
| Auto-UI |
Action Accuracy |
60.50 |
65.29 |
+4.79
|
Main Takeaways
- Two-stage pre-training (Intra-UI then Inter-UI) significantly improves downstream navigation performance compared to direct fine-tuning.
- The 'unique page' mechanism in data collection allows for graph-based learning, which aids in understanding app logic better than linear interaction traces.
- MobileVLM generalizes well to unseen apps (as shown in UnseenAPP benchmarks), outperforming standard Qwen-VL-Chat significantly.
- Stage 2 pre-training (Action Prediction) is crucial for navigation tasks but has negligible impact on static VQA tasks.