← Back to Paper List

Capabilities of GPT-5 on Multimodal Medical Reasoning

Shansong Wang, Mingzhe Hu, Qiang Li, M. Safari, Xiaofeng Yang
Department of Radiation Oncology, Winship Cancer Institute, Emory University School of Medicine
arXiv.org (2025)
MM Reasoning Benchmark QA

📝 Paper Summary

Medical Multi-modal Learning Clinical Decision Support Visual Question Answering (VQA)
This study benchmarks GPT-5 on diverse medical exams, showing it surpasses GPT-4o and human experts in multimodal diagnostic reasoning by integrating text, imaging, and structured data.
Core Problem
Current medical AI models struggle to consistently integrate heterogeneous evidence (patient history, lab results, medical images) without extensive domain-specific fine-tuning, often performing below human experts.
Why it matters:
  • Real-world clinical decision-making requires synthesizing diverse modalities (text, CT scans, vitals) simultaneously, not just processing text alone
  • Previous models like GPT-4o remain below human-expert performance benchmarks in complex multimodal reasoning tasks, limiting their reliability for high-stakes clinical support
  • Existing evaluations often lack a unified protocol, making it difficult to isolate model architectural improvements from prompt engineering gains
Concrete Example: In a case of esophageal perforation, a model must correlate 'suprasternal crepitus' (text) with specific CT findings (image) to diagnose Boerhaave syndrome. Weaker models might miss the visual cue or fail to link it to the clinical history, leading to incorrect treatment recommendations.
Key Novelty
Unified Multimodal Medical Benchmarking of GPT-5
  • First controlled, longitudinal evaluation of GPT-5 against GPT-4o and human experts using identical zero-shot Chain-of-Thought (CoT) prompting across text and multimodal tasks
  • Demonstrates a qualitative shift from 'human-comparable' (GPT-4) to 'super-human' performance in integrating visual and textual clinical evidence
Evaluation Highlights
  • +29.26% improvement in reasoning accuracy on MedXpertQA MM (multimodal) compared to GPT-4o-2024-11-20
  • Surpasses pre-licensed human experts by +24.23% in multimodal reasoning and +29.40% in understanding on MedXpertQA MM
  • Achieves 95.84% on MedQA (USMLE-style), exceeding GPT-4o by 4.80% and demonstrating significantly improved clinical fact recall
Breakthrough Assessment
9/10
Shows a massive leap (+26-29%) in multimodal reasoning over the previous SOTA (GPT-4o) and convincingly beats human expert baselines, marking a potential paradigm shift for AI in medicine.
×