MMedAgent: Learning to Use Medical Tools with Multi-modal Agent

📝 Paper Summary

Medical Multi-modal Large Language Models Medical AI Agents

MMedAgent is the first general-purpose multi-modal medical agent that integrates diverse specialized tools via an instruction-tuned LLaVA-Med backbone to solve complex medical tasks across multiple imaging modalities.

Core Problem

Existing medical MLLMs are either generalists with limited depth or specialists restricted to narrow tasks, lacking the ability to seamlessly plan and execute multiple complex tasks across diverse imaging modalities.

Why it matters:

Clinical practice requires handling varied data types (MRI, CT, X-ray) and tasks (diagnosis, segmentation, report generation) simultaneously, which single models struggle to do effectively
Current generalist models lack the expert-level precision of specialized tools, while specialized tools cannot handle flexible natural language instructions

Concrete Example: A user asks to 'detect the lesion in this CT scan and then segment it.' A standard VQA model might describe the lesion textually but cannot output a mask. A segmentation model (like MedSAM) needs a specific bounding box input, not a high-level text command. MMedAgent bridges this by calling a detection tool first, then passing coordinates to the segmentation tool.

Key Novelty

Multi-modal Medical Agent (MMedAgent) with Adapted Toolset

Connects a central medical MLLM planner (LLaVA-Med) to six specialized tools using a unified 'Thought-Action-Value' dialogue format
Adapts general vision tools to the medical domain (e.g., fine-tuning Grounding DINO on medical data) to fill gaps where off-the-shelf medical tools were missing
Instruction-tunes the planner on a newly curated dataset that teaches it when to call tools and how to aggregate their results into a final response

Architecture

The inference workflow of MMedAgent. It illustrates how the Planner receives instruction/image, generates 'Thought' and 'Action', executes the 'Tool', and aggregates results into a final 'Answer'.

Evaluation Highlights

Outperforms GPT-4o on average across representative medical tasks, specifically achieving higher scores in Segmentation and Detection metrics
Achieves state-of-the-art performance compared to open-source medical MLLMs (e.g., LLaVA-Med, RadFM) on VQA, report generation, and classification tasks
Demonstrates efficient extensibility: successfully learns to use a new tool (CT-Seg) with only 20 instruction-tuning samples

Breakthrough Assessment

8/10

First comprehensive multi-modal agent specifically for medicine. It successfully bridges the gap between high-level reasoning and low-level specialized medical tools, showing superior performance to both specialized and generalist baselines.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal agentic reasoning where an LLM planner receives user instruction and medical image, selects appropriate external tools, and aggregates results.

Inputs: User instruction X_q and medical image I_q

Outputs: Final natural language answer X_answer, potentially incorporating tool outputs (images/masks/text)

Pipeline Flow

Planner (LLaVA-Med) analyzes Input (Image + Text)
Planner generates Thought and Action (Tool Call)
Tool Execution (External Model runs on Image)
Planner Aggregation (Combines User Input + Tool Output)
Final Response Generation

System Modules

Action Planner / Aggregator

Parses user intent, decides on tool usage, generates API calls, and synthesizes final response

Model or implementation: LLaVA-Med (fine-tuned)

Grounding Tool (Perception Tools)

Localizes objects in medical images via bounding boxes

Model or implementation: Grounding DINO (fine-tuned on medical datasets)

Segmentation Tool (B-Seg / G-Seg) (Perception Tools)

Segments regions of interest based on bounding boxes

Model or implementation: MedSAM

Report Generation Tool

Generates detailed medical reports for X-rays

Model or implementation: ChatCAD

Retrieval Tool (RAG)

Retrieves medical knowledge from Merck Manual

Model or implementation: ChatCAD+

Classification Tool (Perception Tools)

Classifies images into closed-set categories (modalities, organs)

Model or implementation: BiomedCLIP

Novel Architectural Elements

First application of the 'Thought-Action-Value' agentic loop specifically for the medical domain using a specialized LLaVA-Med backbone
Pipeline includes a custom-adapted medical grounding tool (Fine-tuned Grounding DINO) to enable text-prompted segmentation in medical images

Modeling

Base Model: LLaVA-Med

Training Method: Visual Instruction Tuning (End-to-End)

Objective Functions:

Purpose: Auto-regressive language modeling on the generated sequence (thoughts, tool calls, answers).

Formally: Standard cross-entropy loss on tokens.

Adaptation: Full fine-tuning of the LLaVA-Med backbone

Trainable Parameters: Not explicitly reported in the paper

Training Data:

Curated instruction-tuning dataset comprising six medical tools solving seven tasks
Data generated by querying GPT-4o with one-shot examples to create 'Thought-Action-Value' dialogues

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 128
epochs: 3
+ 1 more
optimizer: AdamW

Compute: 8 NVIDIA A100 GPUs for training

Comparison to Prior Work

vs. RadFM: MMedAgent uses specialized tools for tasks like segmentation and detection, whereas RadFM attempts to solve them within a single model architecture.
vs. LLaVA-Med: MMedAgent adds the planning and tool-use layer, allowing it to perform tasks (like segmentation) that LLaVA-Med cannot do natively.
vs. GPT-4o: MMedAgent is open-source and specifically tuned for medical tool integration, offering better grounding and segmentation performance in medical contexts.

Limitations

The performance of the agent is upper-bounded by the performance of the individual tools (e.g., if MedSAM fails, the agent fails).
The context length of the backbone LLM limits the amount of history and tool outputs that can be processed.
Requires explicit instruction tuning for every new tool added, though few-shot efficiency is demonstrated.

Reproducibility

Code: https://github.com/Wangyixinxin/MMedAgent

Code and web UI publicly available at https://github.com/Wangyixinxin/MMedAgent. The paper details the specific datasets used for tool fine-tuning (FLARE2021, WORD, etc.). Pre-trained weights for the agent are not explicitly linked but code is provided.

📊 Experiments & Results

Evaluation Setup

Comprehensive evaluation across 7 tasks: Grounding, Segmentation (B-Seg, G-Seg), Classification, Medical Report Generation (MRG), RAG, and VQA.

Benchmarks:

RSNA Pneumonia (Grounding/Detection)
COVID-19 CT Segmentation (Segmentation)
MIMIC-CXR (Medical Report Generation)
VQA-RAD (Visual Question Answering)
SLAKE (Visual Question Answering)

Metrics:

IoU (Intersection over Union)
Dice Score
Accuracy
BLEU-4
CIDEr
ROUGE-L
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis against State-of-the-Art (SOTA) methods across different medical tasks.
RSNA Pneumonia (Grounding)	IoU	0.589	0.781	+0.192
COVID-19 CT Segmentation (G-Seg)	Dice	0.081	0.760	+0.679
VQA-RAD	Open-Ended Accuracy	60.48	73.90	+13.42
MIMIC-CXR (Report Generation)	RadGraph F1	11.1	29.8	+18.7
MIMIC-CXR (Report Generation)	CIDEr	0.059	0.264	+0.205
CT-Seg (New Tool Integration)	Success Rate	0.0	100.0	+100.0

Experiment Figures

Radar chart comparing MMedAgent against baselines (GPT-4o, LLaVA-Med, RadFM) across 5 metrics (VQA, G-Seg, MRG, Grounding, Classification).

Main Takeaways

MMedAgent consistently outperforms generalist models (GPT-4o, LLaVA-Med) on specialized tasks like segmentation and grounding by leveraging expert tools.
The agentic framework enhances the backbone model's inherent VQA capabilities, likely due to better instruction following learned during training.
The system is highly data-efficient when learning new tools, requiring very few examples to master the invocation of a new medical tool.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multi-modal Large Language Models (MLLMs)
Familiarity with visual instruction tuning
Basic knowledge of medical imaging modalities (CT, MRI, X-ray)

Key Terms

VQA: Visual Question Answering—answering natural language questions about an image

LLaVA-Med: A large language and vision assistant specifically trained on biomedical images and text

Grounding DINO: An open-set object detector that can identify objects based on text descriptions

SAM: Segment Anything Model—a foundation model for image segmentation

MedSAM: A version of SAM fine-tuned for medical images

RAG: Retrieval-Augmented Generation—enhancing model responses by retrieving relevant information from external knowledge bases

IoU: Intersection over Union—a metric for measuring the overlap between a predicted bounding box/mask and the ground truth

Dice score: A metric used to gauge the similarity of two samples (often binary masks in segmentation)

RadFM: A generalist foundation model for radiology

BiomedCLIP: A vision-language foundation model pre-trained on biomedical literature and images

MIMIC-CXR: A large publicly available dataset of chest radiographs with radiology reports

grounding: Identifying and localizing specific objects within an image (often with bounding boxes)

CIDEr: Consensus-based Image Description Evaluation—a metric for evaluating image captioning quality

BLEU: Bilingual Evaluation Understudy—a metric for evaluating machine-translated text against reference text

ROUGE-L: Recall-Oriented Understudy for Gisting Evaluation (Longest Common Subsequence)—metric for evaluating text summarization