| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Performance of Inference-Time Scaled Models vs. Vanilla LMMs and Humans. o1 demonstrates superior performance but significant room for improvement remains. | ||||
| MedXpertQA (Full) | Accuracy | 43.92 | 49.89 | +5.97 |
| MedXpertQA (Full) | Accuracy | 35.96 | 49.89 | +13.93 |
| Text-Only Evaluation showing the impact of reasoning-focused models on complex clinical text. | ||||
| MedXpertQA Text | Accuracy | 30.37 | 37.76 | +7.39 |
| Multimodal Evaluation highlighting the gap between proprietary and open-source models. | ||||
| MedXpertQA MM | Accuracy | 29.95 | 42.80 | +12.85 |