| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Static evaluation on GUI-Critic-Test showing the model's ability to correctly diagnose errors and provide valid suggestions compared to baselines. | ||||
| GUI-Critic-Test | Exact Match (EM) | 86.8 | 91.0 | +4.2 |
| GUI-Critic-Test | Suggestion Validity (SV) | 31.7 | 86.1 | +54.4 |
| Dynamic evaluation on AndroidWorld benchmark, measuring how much the pre-critic improves a baseline agent's success rate. | ||||
| AndroidWorld | Success Rate | 22.4 | 27.6 | +5.2 |
| Ablation study demonstrating the impact of different training stages and rewards. | ||||
| GUI-Critic-Test | Exact Match (EM) | 88.6 | 91.0 | +2.4 |