← Back to Paper List

M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models

Fan Bai, Yuxin Du, Tiejun Huang, Max Q.‐H. Meng, Bo Zhao
Peking University, Beijing Academy of Artificial Intelligence, Southern University of Science and Technology
arXiv.org (2024)
MM Benchmark Pretraining QA

📝 Paper Summary

Medical Multi-Modal Large Language Models 3D Medical Image Analysis
M3D introduces a large-scale 3D medical dataset, a comprehensive benchmark, and a specialized MLLM (M3D-LaMed) that processes 3D volumes directly via a spatial pooling perceiver.
Core Problem
Existing medical MLLMs are primarily designed for 2D images, struggling with 3D medical modalities (CT/MRI) by either failing outright or relying on costly, information-lossy slice-by-slice analysis.
Why it matters:
  • 3D medical images (CT, MRI) contain critical spatial information essential for diagnosis that is lost in 2D slice-based approaches
  • Lack of large-scale 3D image-text datasets hinders the development of native 3D medical AI models compared to the abundance of 2D data
  • Current benchmarks lack comprehensive evaluation across diverse 3D tasks like segmentation, positioning, and reporting simultaneously
Concrete Example: When analyzing a 3D CT scan for a tumor, a 2D-based model might process individual slices effectively but fail to understand the tumor's volumetric shape or spatial relationship to other organs, whereas M3D-LaMed processes the full 3D context.
Key Novelty
Native 3D Medical MLLM Ecosystem (Dataset + Model + Benchmark)
  • Constructs the largest 3D medical dataset (M3D-Data) by crawling professional sites and using LLMs to generate instruction pairs from diagnostic reports
  • Proposes a 3D Spatial Pooling Perceiver that compresses high-dimensional 3D visual tokens into a manageable sequence for the LLM while preserving spatial structure
  • Integrates a promptable segmentation module (SegVol) allowing the MLLM to output segmentation masks for referring expressions in 3D space
Evaluation Highlights
  • M3D-LaMed outperforms baselines on M3D-Bench across 8 tasks, specifically in 3D image-text retrieval and report generation
  • M3D-Data generation pipeline achieves a 99.4% pass rate in expert review for automated VQA instruction generation
  • Establishes the first comprehensive benchmark (M3D-Bench) covering retrieval, reporting, VQA, positioning, and segmentation for 3D medical images
Breakthrough Assessment
9/10
Significantly advances the under-explored field of 3D medical MLLMs by providing the missing critical infrastructure: a large-scale dataset, a native 3D model architecture, and a standardized benchmark.
×