MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

📝 Paper Summary

Medical Benchmarking Multimodal Reasoning Clinical Decision Making

MedXpertQA introduces a challenging medical benchmark constructed from specialty board exams and diverse clinical images to evaluate expert-level reasoning capabilities where current models like GPT-4o struggle.

Core Problem

Existing medical benchmarks are either too easy (saturated by current models) or lack clinical realism, relying on general licensing exams or simple caption-based multimodal questions rather than complex diagnostic tasks.

Why it matters:

Current benchmarks like MedQA and MMLU-Medical are saturated, with models like o1 achieving ~99% accuracy, making it impossible to distinguish true expert reasoning from memorization
Traditional multimodal benchmarks use simple QA pairs generated from captions, failing to simulate real-world clinical workflows where doctors must synthesize patient history, vitals, and imaging

Concrete Example: A traditional benchmark might ask 'What organ is this?' given an MRI. MedXpertQA presents a 27-year-old patient's history of migraines and hypothyroidism, vital signs showing hypertension, and a lab report, then asks for the best next treatment step (e.g., 'Benztropine') among 5-10 plausible distractors.

Key Novelty

MedXpertQA (Benchmark for Expert-Level Medical Reasoning)

Incorporates questions from 17 specific specialty board exams (e.g., Family Medicine, Addiction Medicine) rather than just general licensing exams, increasing difficulty and domain specificity
Implements a rigorous filtering pipeline using 'AI Experts' (models) and 'Human Experts' (using adaptive Brier score thresholds) to ensure only non-trivial questions remain
Utilizes data synthesis (question rewriting and option augmentation) to minimize data leakage risks while preserving clinical accuracy through licensed physician review

Architecture

The construction pipeline of MedXpertQA, detailing data sources, filtering steps, and the final dataset composition.

Evaluation Highlights

o1 achieves 49.89% average accuracy, significantly outperforming GPT-4o (35.96%) and pre-licensed human experts (43.92%), yet remains below 50%, indicating high difficulty
DeepSeek-R1 scores 37.76% on the Text subset, outperforming GPT-4o (30.37%) and demonstrating the value of inference-time scaling for medical reasoning
Multimodal performance gap: GPT-4o (42.80%) outperforms Qwen2.5-VL-72B (29.95%) by a large margin on the MM subset, highlighting proprietary model dominance in visual medical tasks

Breakthrough Assessment

8/10

Significantly raises the bar for medical AI by addressing saturation in existing benchmarks. The inclusion of specialty boards and rigorous leakage prevention makes it a robust standard for next-gen reasoning models.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot Multiple-Choice Question Answering (Text and Multimodal)

Inputs: Clinical case description (text), optional medical images (radiology, pathology, etc.), and a set of 5-10 answer options

Outputs: The index of the correct medical diagnosis or treatment plan

Pipeline Flow

Data Collection (Exam Sources)
AI Expert Filtering
Human Expert Filtering
Similarity Filtering
Data Synthesis & Augmentation
Expert Review

System Modules

Data Collection

Source questions from USMLE, COMLEX, 17 Specialty Boards, and NEJM Image Challenge

Model or implementation: N/A

AI Expert Filtering (Filtering)

Remove questions that are too easy for current AI models

Model or implementation: Ensemble of 8 models (Basic and Advanced)

Human Expert Filtering (Filtering)

Select questions that are challenging for humans based on response statistics

Model or implementation: Adaptive Brier Score Thresholding

Data Synthesis

Rewrite questions and augment options to prevent leakage and increase difficulty

Model or implementation: GPT-4o / Claude-3.5-Sonnet

Novel Architectural Elements

Integration of specialty board exams (17 specialties) extending beyond general licensing scope
Hierarchical filtering using both AI consensus and human Brier score distributions

Modeling

Base Model: Evaluates 18 models including o1, GPT-4o, Gemini, Claude, Qwen, and DeepSeek variants

Compute: Not reported in the paper

Comparison to Prior Work

vs. MedQA: MedXpertQA includes 17 specialty boards and multimodal data, whereas MedQA is limited to general licensing text questions
vs. OmniMedVQA: MedXpertQA uses real clinical exam questions with expert annotations, while OmniMedVQA relies on captions and automated generation
vs. MMMU: MedXpertQA focuses exclusively on expert-level clinical tasks with real-world noise (e.g., patient history), whereas MMMU is broader and academic

Limitations

Dataset primarily reflects United States medical standards (USMLE/American Boards), potentially limiting global applicability
Cost constraints restricted o1 and o3-mini evaluation to a 10% sampled subset rather than the full benchmark
Multimodal questions are limited to 5 options due to image dependency, unlike the 10 options for text questions, complicating direct comparison between subsets

Reproducibility

Code: https://github.com/TsinghuaC3I/MedXpertQA

publicly available (https://github.com/TsinghuaC3I/MedXpertQA). The benchmark dataset and evaluation code are released. Proprietary models (o1, GPT-4o) used for synthesis and evaluation are closed-source.

📊 Experiments & Results

Evaluation Setup

Zero-shot Chain-of-Thought (CoT) evaluation on Text and Multimodal subsets

Benchmarks:

MedXpertQA Text (Specialty-level Medical QA) [New]
MedXpertQA MM (Multimodal Clinical Diagnosis) [New]

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance of Inference-Time Scaled Models vs. Vanilla LMMs and Humans. o1 demonstrates superior performance but significant room for improvement remains.
MedXpertQA (Full)	Accuracy	43.92	49.89	+5.97
MedXpertQA (Full)	Accuracy	35.96	49.89	+13.93
Text-Only Evaluation showing the impact of reasoning-focused models on complex clinical text.
MedXpertQA Text	Accuracy	30.37	37.76	+7.39
Multimodal Evaluation highlighting the gap between proprietary and open-source models.
MedXpertQA MM	Accuracy	29.95	42.80	+12.85

Experiment Figures

A bar chart comparing model performance on MedXpertQA Text versus other popular medical benchmarks (MedQA, MMLU).

Comparison of performance between vanilla models and their inference-time scaled counterparts (e.g., DeepSeek-V3 vs R1).

Main Takeaways

Medical specialty questions in MedXpertQA are significantly harder than general licensing exams, dropping state-of-the-art model performance from ~90% (MedQA) to <50%.
Inference-time scaling (o1, DeepSeek-R1) provides substantial gains in medical reasoning, consistently outperforming vanilla models like GPT-4o and Qwen2.5-72B.
A significant 'Reasoning' gap exists: models consistently score lower on the 'Reasoning' subset compared to 'Understanding' questions, validating the benchmark's categorization.
Visual perception remains a bottleneck; even strong text reasoners struggle when critical diagnostic information is locked within complex medical imagery.

📚 Prerequisite Knowledge

Prerequisites

Understanding of medical licensing vs. specialty board exams
Familiarity with multiple-choice QA evaluation metrics
Basic knowledge of large multimodal model architectures

Key Terms

USMLE: United States Medical Licensing Examination—a general multi-step examination for medical licensure in the U.S.

COMLEX-USA: Comprehensive Osteopathic Medical Licensing Examination—a licensure exam series for osteopathic physicians, distinct from USMLE

Brier score: A proper score function measuring the accuracy of probabilistic predictions; used here to quantify question difficulty based on human response distributions

inference-time scaling: Techniques that increase computational effort during the generation phase (e.g., generating internal chain-of-thought) to improve reasoning performance

data leakage: The issue where test data is inadvertently included in a model's training set, leading to artificially inflated performance scores

MedCPT: A medical-specific contrastive pre-trained transformer model used for generating embeddings to measure semantic similarity between questions

distractors: Incorrect options in a multiple-choice question designed to test precision and understanding