MedVLThinker: Simple Baselines for Multimodal Medical Reasoning

📝 Paper Summary

Medical Visual Question Answering (Med-VQA) Large Reasoning Models (LRMs)

MedVLThinker demonstrates that Reinforcement Learning with Verifiable Rewards (RLVR) on filtered text-only medical data significantly outperforms supervised fine-tuning and multimodal training for medical visual reasoning tasks.

Core Problem

Current medical Large Multimodal Models (LMMs) lack open, reproducible recipes for reasoning capabilities, often relying on closed data or failing to integrate 'thinking' paradigms effectively with multimodal inputs.

Why it matters:

The absence of open recipes hinders community research and fair comparison in medical AI
Clinicians need models that can 'think before answering' to handle complex multimodal diagnoses reliability
Existing approaches are either closed-source, limited to specific modalities (e.g., MRI only), or release weights without training code

Concrete Example: When trained via standard Supervised Fine-Tuning (SFT) on reasoning traces, the model's performance actually degrades compared to the base model (e.g., accuracy drops from 53.5% to 43.8%), showing that naive imitation of reasoning chains is ineffective compared to RL-based self-reasoning.

Key Novelty

RLVR-centric Open Recipe for Medical LMMs

Applies Reinforcement Learning with Verifiable Rewards (RLVR) using Group Relative Policy Optimization (GRPO) to medical visual QA, rewarding correct final answers rather than imitating reasoning traces
Implements a rigorous difficulty-based data filtering pipeline ('pass count' filtering) to remove trivial or impossible questions before training
Discovering that training on text-only reasoning data with RL provides larger gains for *multimodal* tasks than training on multimodal data itself

Architecture

The complete MedVLThinker pipeline including data curation, difficulty filtering, and the two training paradigms (SFT vs RLVR)

Evaluation Highlights

MedVLThinker-7B (RLVR on text-only data) achieves 54.9% average accuracy across 6 benchmarks, setting a new state-of-the-art for open medical LMMs
RLVR on text-only data improves the 7B base model from 53.5% to 54.9%, while SFT on text-only data degrades it to 43.8%
Scaling the approach to 32B parameters achieves performance on par with the proprietary GPT-4o model

Breakthrough Assessment

8/10

Provides the first fully open recipe for reasoning medical LMMs and reveals the counter-intuitive finding that text-only RLVR beats multimodal training for visual tasks. Strong empirical results matching GPT-4o.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Medical Question Answering where the model generates a reasoning chain followed by a final answer

Inputs: Medical image (optional) + Text question

Outputs: Reasoning trace (Chain-of-Thought) + Final Answer Choice

Pipeline Flow

Data Curation (Difficulty Filtering)
Training (SFT or RLVR)
Inference (Reasoning Generation)

System Modules

Base Model

Multimodal backbone used for initializing training and performing inference

Model or implementation: Qwen2.5-VL (3B, 7B, 32B)

Novel Architectural Elements

Integration of difficulty-based 'pass count' filtering directly into the training data pipeline for medical reasoning

Modeling

Base Model: Qwen2.5-VL (3B, 7B, 32B)

Training Method: Reinforcement Learning with Verifiable Rewards (RLVR) via GRPO

Objective Functions:

Purpose: Optimize policy to maximize expected reward while staying close to reference model.

Formally: GRPO objective using clipped surrogate loss with KL divergence regularization.
Purpose: Verify answer correctness.

Formally: Reward = +1 if (format is correct AND answer is correct), else -1.

Training Data:

Text-only: m23k dataset (filtered from 23k to 16,512 questions via pass count)
Multimodal: PMC-VQA dataset (filtered from 177k to 115,456 questions via pass count)
Pass count filtering: Keep questions with 0 < correct_trials < 7 (out of 16)

Key Hyperparameters:

learning_rate_rl: 1e-6
learning_rate_sft: 1e-4
batch_size_rl_text: 128
+ 5 more
batch_size_rl_image: 256
group_size_n: 8 (samples per question)
epochs_rl_text: 5
epochs_rl_image: 1
epochs_sft: 3

Compute: Trained on 8x H100 GPUs (for 3B/7B) or 32 GPUs (for 32B)

Comparison to Prior Work

vs. HuatuoGPT-o1: Uses GRPO (no critic network) and targets multimodal tasks, not just text
vs. MedVLM-R1: Scales to >100k samples and uses comprehensive difficulty filtering
vs. LLaVA-Med: Incorporates explicit 'thinking' (reasoning chains) via RL rather than just instruction tuning

Limitations

RLVR requires questions with objectively verifiable answers (e.g., multiple choice), limiting applicability to open-ended medical inquiries
Text-only training proved more effective than multimodal training, suggesting current multimodal RL strategies or data may be suboptimal
Reliance on teacher models (DeepSeek, GPT-4o) for SFT data generation and difficulty filtering creates dependency on stronger models

Reproducibility

Code: https://github.com/UCSC-VLAA/MedVLThinker

Publicly available: code, models, and curated data at https://github.com/UCSC-VLAA/MedVLThinker. No closed-source dependencies for inference (uses open Qwen2.5-VL base).

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on multiple choice medical visual QA tasks

Benchmarks:

PMC-VQA (Test) (General biomedical visual QA)
MMMU-Health (Val) (Multidisciplinary medical reasoning)
MedXpert-MM (Complex multimodal medical reasoning)
PathVQA (Pathology visual QA)
SLAKE (Radiology/Clinical visual QA)
VQA-Rad (Radiology visual QA)

Metrics:

Accuracy
Statistical methodology: Runs each evaluation 3 times, reports average. Standard deviation < 0.1.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of different training paradigms on the 7B model shows RLVR on text-only data yields the highest performance, while SFT degrades it.
Average (6 benchmarks)	Accuracy	53.5	54.9	+1.4
Average (6 benchmarks)	Accuracy	53.5	43.8	-9.7

Main Takeaways

RLVR consistently outperforms SFT across model scales (3B, 7B), confirming that self-correction via reward is superior to imitation for reasoning
Text-only training surprisingly outperforms image-text training for improving multimodal reasoning capabilities
Model scale is critical: 7B models consistently outperform 3B models, and the 32B model reaches parity with proprietary SOTA (GPT-4o)
Combining text and image data (sequential RL or SFT+RL) does not yield gains over text-only RLVR, suggesting interference or data quality issues in current multimodal sets

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Large Multimodal Models (LMMs)
Chain-of-Thought (CoT) reasoning

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training models to reason by rewarding correct final answers rather than supervising the reasoning steps themselves

GRPO: Group Relative Policy Optimization—an efficient RL algorithm that normalizes rewards within a group of sampled outputs to update the policy without a separate value network

SFT: Supervised Fine-Tuning—training a model to mimic a dataset of inputs and target outputs

CoT: Chain-of-Thought—a prompting or training technique where the model generates intermediate reasoning steps before the final answer

pass count: A difficulty metric defined as the number of times a model answers a question correctly out of N trials (used here for data filtering)

LMM: Large Multimodal Model—an AI model capable of processing and reasoning over multiple data modalities like text and images

nucleus sampling: A text generation method that samples from the top probability mass (top-p) of the vocabulary distribution