| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Validation of the evaluation methodology itself, specifically the effectiveness of GPT-4 as an answer extractor compared to human judgment. | ||||
| MMBench (Subset) | Alignment with Human Assessment | 100 | 91.5 | -8.5 |
| Evaluation of model instruction-following capabilities using heuristic matching rates. | ||||
| MMBench | Heuristic Matching Success Rate | 100 | Not reported in the paper | Not reported in the paper |