Evaluation Setup
Evaluation on Android device using Mobile-Eval benchmark
Benchmarks:
- Mobile-Eval (Mobile App Navigation) [New]
Metrics:
- Success (Su)
- Process Score (PS): Accuracy of each step
- Relative Efficiency (RE): Comparison to human step count
- Completion Rate (CR): Percentage of human-steps completed
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance on Mobile-Eval across three difficulty levels (Instruction 1=Simple, 2=Additional requirements, 3=Abstract). |
| Mobile-Eval (Instruction 1) |
Completion Rate (CR) |
1.00 |
0.93 |
-0.07
|
| Mobile-Eval (Instruction 2) |
Completion Rate (CR) |
1.00 |
0.85 |
-0.15
|
| Mobile-Eval (Instruction 3) |
Completion Rate (CR) |
1.00 |
0.85 |
-0.15
|
| Mobile-Eval (Average) |
Relative Efficiency (RE) |
1.00 |
0.80 |
-0.20
|
Main Takeaways
- Mobile-Agent achieves high completion rates (>80%) even on abstract or multi-app tasks, validating the vision-only approach.
- The Process Score (PS) is often lower than Success rate (Su), indicating the agent makes mistakes but successfully uses self-reflection to correct them and finish the task.
- Capable of cross-app workflows (e.g., TikTok to Maps) and handling multilingual apps (Chinese) despite GPT-4V limitations.