← Back to Paper List

MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning

Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, H. Li, Chen Chen, Ouyang Cheng, Daniel Rueckert
Affiliations not listed in provided text snippet (typically found in header/footer not included in input)
International Conference on Medical Image Computing and Computer-Assisted Intervention (2025)
MM RL Reasoning QA

📝 Paper Summary

Medical Vision-Language Models (Med-VLM) Radiology Report Generation / VQA Reinforcement Learning for Reasoning
MedVLM-R1 uses reinforcement learning with rule-based rewards to teach a small medical vision-language model to generate explicit reasoning steps before answering, without requiring expensive reasoning annotations.
Core Problem
Existing medical VLMs trained via supervised fine-tuning (SFT) often overfit to training distributions, struggle with out-of-distribution generalization, and fail to provide the transparent step-by-step reasoning required for clinical trust.
Why it matters:
  • Clinicians and patients need to understand *why* a diagnosis was reached, not just the final classification, for trust and regulatory approval
  • SFT relies on expensive, hard-to-scale expert reasoning data (Chain-of-Thought), limiting the ability to train robust models
  • Models trained only on final answers via SFT prone to shortcut learning and degrade significantly when shifting domains (e.g., MRI to CT)
Concrete Example: In complex queries, standard models might guess the correct answer via pattern matching without understanding. MedVLM-R1, however, generates a '<think>' trace (e.g., analyzing pulmonary nodules) before the '<answer>', though failure cases show it sometimes uses process-of-elimination rather than pure medical deduction.
Key Novelty
MedVLM-R1 (Medical VLM via Reinforcement Learning)
  • Replaces Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to incentivize reasoning using only final-answer labels
  • Uses a composite reward function (Format Reward + Accuracy Reward) to force the model to self-generate a '<think>' block followed by an '<answer>' block
  • Achieves 'emergent reasoning' where the model learns to explain its logic to maximize rewards, without ever seeing ground-truth reasoning traces during training
Architecture
Architecture Figure Figure 1
The training framework of MedVLM-R1 using GRPO.
Evaluation Highlights
  • Boosts accuracy from 55.11% to 78.22% across MRI, CT, and X-ray benchmarks compared to the base model
  • +16% improvement on CT and +35% on X-ray (out-of-distribution) compared to SFT counterparts trained on the same MRI data
  • Outperforms the significantly larger Qwen2-VL-72B and domain-specific HuatuoGPT-Vision-7B despite being a 2B parameter model trained on only 600 samples
Breakthrough Assessment
8/10
Significant efficiency (2B model, 600 samples) and generalization capabilities using RL for reasoning in the medical domain. Demonstrates that reasoning can emerge without expensive CoT supervision.
×