MedRAX: Medical Reasoning Agent for Chest X-ray

📝 Paper Summary

Medical AI Agents Chest X-ray Interpretation Multimodal Medical Reasoning

MedRAX is a tool-using AI agent that orchestrates specialized medical vision models through an LLM reasoning engine to solve complex chest X-ray interpretation tasks without additional training.

Core Problem

Existing AI tools for chest X-rays (CXRs) operate in isolation (fragmented solutions) or suffer from hallucinations and poor multi-step reasoning when using general-purpose multimodal models.

Why it matters:

Radiologists face significant time burdens analyzing CXRs, the most common diagnostic procedure (4.2 billion annually)
End-to-end Large Multimodal Models (LMMs) lack the transparency and reliability required for high-stakes clinical decision-making
Fragmented tools (e.g., separate classifiers and segmenters) hinder widespread clinical adoption due to lack of unified integration

Concrete Example: When asked a complex diagnostic query requiring segmentation, measurement, and classification, a standard LMM might hallucinate findings or fail to cross-reference regions, whereas MedRAX sequentially calls a segmenter, measures the region, and then classifies it.

Key Novelty

MedRAX: A Training-Free Modular Reasoning Agent

Integrates heterogeneous expert models (segmentation, classification, VQA) into a unified ReAct (Reasoning and Acting) loop driven by an LLM
Decouples tool creation from agent instantiation, allowing dynamic selection of specialized tools without retraining the core reasoning engine
Introduces ChestAgentBench, a large-scale benchmark derived from real clinical cases to rigorously evaluate multi-step medical reasoning

Architecture

The MedRAX framework architecture showing the interaction between the LLM agent and various specialized tools.

Evaluation Highlights

Achieves state-of-the-art performance on ChestAgentBench compared to both open-source and proprietary models
Demonstrates substantial improvements in complex reasoning tasks (detailed finding analysis, clinical decision making) over baseline models like GPT-4o alone
Outperforms biomedical specialist models (like LLaVA-Med and CheXagent) on the newly introduced comprehensive benchmark

Breakthrough Assessment

8/10

Significant for integrating disparate medical AI tools into a coherent agentic workflow and providing a much-needed benchmark for complex reasoning, though primarily an integration of existing SOTA tools rather than new model architecture.

⚙️ Technical Details

Problem Definition

Setting: Automated interpretation of Chest X-ray images through multi-step reasoning and tool execution

Inputs: Chest X-ray image and natural language medical query

Outputs: Textual response answering the query, potentially accompanied by visual artifacts (plots, segmentations)

Pipeline Flow

User Query Processing (Observation)
Reasoning Engine (Thought: determines next action)
Tool Execution (Action: calls specific model)
Response Integration (incorporates tool output into memory)
Final Response Generation

System Modules

Reasoning Engine

Orchestrates the workflow using ReAct loop, maintains memory, and decides which tool to call

Model or implementation: GPT-4o (Vision) [Reference implementation, swappable]

VQA Tool (Perception Tools)

Answers free-form visual questions

Model or implementation: CheXagent (8.5M training samples) or LLaVA-Med (7B VLM)

Segmentation Tool (Perception Tools)

Segments anatomical structures or abnormalities

Model or implementation: MedSAM (biomedical segmentation) or PSPNet (ChestX-Det trained)

Grounding Tool (Perception Tools)

Localizes specific findings with bounding boxes

Model or implementation: Maira-2 (7B VLM)

Disease Classification Tool (Perception Tools)

Detects specific pathologies

Model or implementation: DenseNet-121 (TorchXRayVision)

Report Generation Tool

Generates full radiology reports

Model or implementation: SwinV2 Transformer + BERT decoder

Novel Architectural Elements

Dynamic orchestration of heterogeneous medical expert models (classifiers, segmenters, VLMs) within a unified ReAct agent without retraining
Decoupled tool integration where the LLM learns usage via context/definition rather than weight updates

Modeling

Base Model: GPT-4o (Reasoning Engine)

Training Method: In-context learning / Tool use (No gradient updates to the reasoning engine)

Compute: Supports flexible deployment (local to cloud). Tools can be quantized and distributed across CPU/GPU.

Comparison to Prior Work

vs. CheXagent/LLaVA-Med: MedRAX is an agentic framework that uses tools rather than a single end-to-end model, enabling better multi-step reasoning.
vs. MDAgents: MedRAX focuses on single-agent tool orchestration to reduce computational overhead compared to multi-agent coordination.
vs. MMedAgent: MedRAX does not require retraining to integrate new tools and is specialized for Chest X-ray depth rather than broad multi-modality.
+ 1 more
vs. RaDialog [not cited in paper]: MedRAX integrates discrete tools (segmentation, classification) rather than relying solely on a conversational vision-language model.

Limitations

Dependency on the performance of underlying tools (errors in tools propagate to the agent)
Latency may be higher than end-to-end models due to multiple sequential tool calls
Proprietary reasoning engines (like GPT-4o) raise privacy and cost concerns for clinical deployment

Reproducibility

Code: https://github.com/bowang-lab/MedRAX

Code and data publicly available at https://github.com/bowang-lab/MedRAX. Framework uses open-source tools (CheXagent, MedSAM, etc.) and proprietary LLMs (GPT-4o) via API. Interface built with Gradio.

📊 Experiments & Results

Evaluation Setup

Evaluation on ChestAgentBench, a new benchmark for complex CXR reasoning.

Benchmarks:

ChestAgentBench (Complex Medical VQA (Multi-step reasoning)) [New]

Metrics:

Accuracy (Percentage of correct answers in six-choice questions)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MedRAX achieves state-of-the-art performance on the ChestAgentBench compared to various baselines.
ChestAgentBench	Accuracy	Not reported in the paper	Not reported in the paper	-
ChestAgentBench	Accuracy	72%	Not reported in the paper	-

Experiment Figures

Overview of the ChestAgentBench dataset construction and statistics.

Main Takeaways

MedRAX demonstrates state-of-the-art performance compared to open-source and proprietary models on the ChestAgentBench.
The agentic approach allows for better handling of complex queries requiring multi-step reasoning (detection -> localization -> diagnosis) compared to end-to-end models.
The framework successfully integrates disparate tools (segmentation, classification, generation) without retraining, validating the modular design.
ChestAgentBench reveals that while progress is made, challenges in structured reasoning and tool integration persist.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and Vision-Language Models (VLMs)
Familiarity with ReAct (Reasoning and Acting) agent frameworks
Basic knowledge of medical imaging tasks (segmentation, classification, grounding)

Key Terms

ReAct: Reasoning and Acting—a paradigm where LLMs generate reasoning traces and task-specific actions (tool calls) in an interleaved manner

CXR: Chest X-ray—a projection radiograph of the chest used to diagnose conditions affecting the chest, its contents, and nearby structures

VQA: Visual Question Answering—the task of answering natural language questions based on the visual content of an image

Grounding: The process of linking textual concepts (e.g., 'nodule') to specific regions or bounding boxes in an image

LangGraph: A library for building stateful, multi-actor applications with LLMs, used here to manage the agent's reasoning loop

DICOM: Digital Imaging and Communications in Medicine—the international standard for medical images and related information

LMM: Large Multimodal Model—a model capable of processing and generating multiple modalities (e.g., text and images)

Zero-shot: The ability of a model to perform a task without having explicitly seen examples of that specific task during training