Capabilities of GPT-5 on Multimodal Medical Reasoning

📝 Paper Summary

Medical Multi-modal Learning Clinical Decision Support Visual Question Answering (VQA)

This study benchmarks GPT-5 on diverse medical exams, showing it surpasses GPT-4o and human experts in multimodal diagnostic reasoning by integrating text, imaging, and structured data.

Core Problem

Current medical AI models struggle to consistently integrate heterogeneous evidence (patient history, lab results, medical images) without extensive domain-specific fine-tuning, often performing below human experts.

Why it matters:

Real-world clinical decision-making requires synthesizing diverse modalities (text, CT scans, vitals) simultaneously, not just processing text alone
Previous models like GPT-4o remain below human-expert performance benchmarks in complex multimodal reasoning tasks, limiting their reliability for high-stakes clinical support
Existing evaluations often lack a unified protocol, making it difficult to isolate model architectural improvements from prompt engineering gains

Concrete Example: In a case of esophageal perforation, a model must correlate 'suprasternal crepitus' (text) with specific CT findings (image) to diagnose Boerhaave syndrome. Weaker models might miss the visual cue or fail to link it to the clinical history, leading to incorrect treatment recommendations.

Key Novelty

Unified Multimodal Medical Benchmarking of GPT-5

First controlled, longitudinal evaluation of GPT-5 against GPT-4o and human experts using identical zero-shot Chain-of-Thought (CoT) prompting across text and multimodal tasks
Demonstrates a qualitative shift from 'human-comparable' (GPT-4) to 'super-human' performance in integrating visual and textual clinical evidence

Evaluation Highlights

+29.26% improvement in reasoning accuracy on MedXpertQA MM (multimodal) compared to GPT-4o-2024-11-20
Surpasses pre-licensed human experts by +24.23% in multimodal reasoning and +29.40% in understanding on MedXpertQA MM
Achieves 95.84% on MedQA (USMLE-style), exceeding GPT-4o by 4.80% and demonstrating significantly improved clinical fact recall

Breakthrough Assessment

9/10

Shows a massive leap (+26-29%) in multimodal reasoning over the previous SOTA (GPT-4o) and convincingly beats human expert baselines, marking a potential paradigm shift for AI in medicine.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot Multimodal Question Answering (QA) and Visual Question Answering (VQA) in medical domains

Inputs: Medical question Q, optional context C (clinical vignette), and optional set of medical images I

Outputs: Chain-of-Thought rationale R followed by a final answer choice A (from multiple-choice options)

Pipeline Flow

Input formatting (System Prompt + Question + Images)
Step 1: Rationale Generation (CoT)
Step 2: Convergence to Final Answer

System Modules

Input Formatter (Inference Protocol)

Constructs the chat history with system message and user query containing text and images

Model or implementation: Procedural script

Reasoning Engine (Inference Protocol)

Generates free-form step-by-step diagnostic reasoning

Model or implementation: GPT-5 (or variants)

Answer Converger (Inference Protocol)

Forces the model to select a specific multiple-choice option based on previous reasoning

Model or implementation: GPT-5 (or variants)

Novel Architectural Elements

Unified multimodal CoT protocol: Images are injected only in the first turn, allowing the model to reason over visual and textual evidence simultaneously in the rationale generation phase before committing to an answer

Modeling

Base Model: GPT-5 (proprietary OpenAI model), compared against GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20

Training Method: Zero-shot inference evaluation only

Adaptation: None (Zero-shot)

Trainable Parameters: None (Inference only)

Compute: Not reported in the paper

Comparison to Prior Work

vs. GPT-4o: GPT-5 shows significantly stronger cross-modal reasoning (+29% on MedXpertQA MM)
vs. Med-PaLM 2 [not cited in paper]: GPT-5 achieves these results zero-shot without domain-specific fine-tuning, whereas Med-PaLM 2 relies on extensive medical instruction tuning

Limitations

Evaluations performed in idealized, standardized testing environments, not real clinical workflows
VQA-RAD performance of GPT-5 was slightly lower than GPT-5-mini, suggesting potential calibration issues on smaller/specialized datasets
Study relies on proprietary models (GPT-5) with opaque architectures and training data
No analysis of cost or latency for real-time clinical deployment

Reproducibility

Code: https://github.com/XYang-Lab/GPT-5-Evaluation

Available: Evaluation code and prompt templates at https://github.com/XYang-Lab/GPT-5-Evaluation. Missing: GPT-5 model weights (proprietary API access required). The exact underlying architecture of GPT-5 is not disclosed by the authors or OpenAI.

📊 Experiments & Results

Evaluation Setup

Zero-shot Chain-of-Thought (CoT) Question Answering across text and multimodal medical datasets

Benchmarks:

MedQA (Medical Licensing QA (Text))
MedXpertQA (Expert-level Medical QA (Text & Multimodal))
MMLU-Medical (General Medical Knowledge QA (Text))
USMLE Self Assessment (Clinical Licensing Exam (Text))
VQA-RAD (Radiology VQA (Multimodal))

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GPT-5 demonstrates consistent improvements over GPT-4o across text-based medical benchmarks, with notable gains in complex reasoning tasks.
MedQA (US 4-option)	Accuracy	91.04	95.84	+4.80
MedXpertQA Text (Reasoning)	Accuracy	30.63	56.96	+26.33
USMLE Step 2	Accuracy	93.33	97.50	+4.17
Multimodal evaluation reveals the largest performance leaps, with GPT-5 significantly outperforming baselines on complex image-text integration tasks.
MedXpertQA MM (Reasoning)	Accuracy	40.73	69.99	+29.26
MedXpertQA MM (Understanding)	Accuracy	48.19	74.37	+26.18
Comparison against human experts highlights a shift from sub-human to super-human performance.
MedXpertQA MM (Reasoning)	Accuracy	45.76	69.99	+24.23
VQA-RAD	Accuracy	74.90	70.92	-3.98

Experiment Figures

A representative case study from MedXpertQA MM showing GPT-5 correctly diagnosing esophageal perforation.

Main Takeaways

GPT-5 moves medical AI from 'human-comparable' to 'super-human' on standardized benchmarks, particularly in multimodal reasoning (+24-29% over experts).
Gains are most pronounced in reasoning-intensive tasks (MedXpertQA, USMLE Step 2) rather than pure knowledge recall (MMLU), suggesting improved inference capabilities.
The model successfully integrates visual findings (CT scans, X-rays) with textual history to correct diagnoses that text-only models might miss.
An anomaly exists in VQA-RAD where GPT-5 underperforms GPT-5-mini, possibly due to safety conservatism or calibration issues on smaller datasets.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Large Multimodal Models (LMMs)
Understanding of Chain-of-Thought (CoT) prompting
Knowledge of standard medical benchmarks (MedQA, USMLE, VQA-RAD)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

CoT: Chain-of-Thought—a prompting technique where the model is asked to generate step-by-step reasoning before providing the final answer

Zero-shot: Evaluating a model without providing any specific training examples for the task in the prompt context

MedQA: A benchmark dataset consisting of multiple-choice questions from medical licensing exams (US, China, Taiwan)

USMLE: United States Medical Licensing Examination—a three-step examination for medical licensure in the U.S.

VQA-RAD: Visual Question Answering in Radiology—a dataset of clinical questions about radiology images

MedXpertQA: A challenging benchmark for expert-level medical knowledge and reasoning, containing both text-only and multimodal subsets

MMLU: Massive Multitask Language Understanding—a broad benchmark covering 57 subjects; this paper uses the Medical subset

Hallucination: In AI, generating plausible-sounding but factually incorrect information

Multimodal: Involving multiple types of data inputs, such as text and images simultaneously