MedEyes: Learning Dynamic Visual Focus for Medical Progressive Diagnosis

📝 Paper Summary

Medical Vision-Language Models Visual Chain-of-Thought (CoT)

MedEyes improves medical visual reasoning by combining autonomous exploration with structured expert eye-tracking trajectories using a dual-stream reinforcement learning framework that prevents reasoning collapse.

Core Problem

Pure on-policy reinforcement learning in medical models often suffers from 'advantage collapse,' generating plausible text without looking at relevant image regions, while supervised fine-tuning overfits to fixed paths.

Why it matters:

Medical diagnosis requires progressive visual focusing (scanning then drilling) which standard models fail to replicate, leading to 'cognitive traps' (repetitive low-quality reasoning)
Lack of explicit grounding between reasoning steps and visual evidence triggers information loss and visual hallucinations in complex imaging tasks
Naive behavior cloning of expert trajectories mimics actions without capturing underlying reasoning logic, limiting generalization to new cases

Concrete Example: In a pneumothorax case (Fig. 1), an SFT model yields vague responses, while a standard CoT model generates a plausible but incorrect path ignoring the actual lesion. MedEyes actively scans for abnormalities and 'drills' down for analysis, correctly locating the issue.

Key Novelty

Hybrid RL with Dual-Stream Advantage Decoupling

Simulates clinician workflows via a Gaze-guided Reasoning Navigator (GRN) that switches between 'scanning' (broad search) and 'drilling' (focused analysis) modes based on confidence
Decouples optimization gradients for on-policy exploration and off-policy expert guidance (Dual-stream GRPO) to prevent expert data from overwhelming the model's self-learning capability
Uses a Confidence Value Sampler (CVS) with nucleus sampling to generate diverse, high-quality expert trajectories that serve as 'cognitive anchors' during training

Evaluation Highlights

Achieves +8.5pp average improvement across five medical VQA benchmarks compared to the best baseline GMAI-VL
Outperforms Qwen2.5-VL-3B by +23.4pp on VQA-RAD (70.7 vs 47.3) and +22.9pp on SLAKE (79.1 vs 56.2)
Surpasses recent medical reasoning models like Med-R1 and DeepEyes across all tested datasets (e.g., +14.3pp vs DeepEyes on VQA-RAD)

Breakthrough Assessment

8/10

Significant performance jumps over strong baselines (GPT-4o, Med-R1) and a methodologically sound approach to the 'advantage collapse' problem in RLVR by integrating structured expert priors.

⚙️ Technical Details

Problem Definition

Setting: Medical Visual Question Answering formulated as a Markov Decision Process (MDP)

Inputs: Medical image I and clinical query q

Outputs: Diagnostic trajectory tau (sequence of reasoning steps, gaze actions, and observations) and final answer a

Pipeline Flow

Input (Image + Query)
Policy Generation (Reasoning Step + Action)
Gaze-guided Reasoning Navigator (GRN) / Tool Execution
Observation Feedback (Visual Crop)
Repeat until Termination -> Final Answer

System Modules

Policy Model

Generates the reasoning trajectory including thoughts, tool calls (Gaze), and final answers

Model or implementation: Not explicitly stated in the provided text (likely a VLM backbone like Qwen or LLaVA based on baselines)

Gaze-guided Reasoning Navigator (GRN)

Simulates expert visual search; maintains attention state (regions, confidence, mode) to guide exploration

Model or implementation: Algorithmic state machine interacting with an 'expert model' (for off-policy generation)

Confidence Value Sampler (CVS)

Constructs diverse off-policy expert trajectories for training

Model or implementation: Nucleus sampling algorithm

Novel Architectural Elements

Dual-stream GRPO optimization objective that separates advantage normalization for on-policy and off-policy data
Hybrid trajectory construction combining 'Scanning' (global detection) and 'Drilling' (local analysis) modes based on confidence deltas

Modeling

Base Model: Not explicitly stated in the provided text

Training Method: Dual-stream Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward across hybrid trajectory distributions.

Formally: J(theta) = sum[min(rho * A, clip(rho) * A)].
Purpose: Reward diagnostic accuracy.

Formally: r_acc = 1 if answer correct, else 0.
Purpose: Reward structural correctness.

Formally: r_grammar = 1 if tags <reasoning>/<action> are valid.
Purpose: Reward exploration diversity.

Formally: r_div encourages visiting unique, spatially distinct regions.

Training Data:

Off-policy data: Generated via CVS nucleus sampling on expert model outputs
On-policy data: Sampled from current policy pi_theta

Key Hyperparameters:

clip_epsilon: Standard PPO clip parameter epsilon (value not explicitly listed, usually 0.1-0.2)
delta: Confidence threshold for switching between scanning and drilling modes

Compute: Not reported in the paper

Comparison to Prior Work

vs. Med-R1: MedEyes uses structured off-policy expert trajectories to guide exploration, whereas Med-R1 relies on pure on-policy RL which can lead to collapse.
vs. Standard GRPO: MedEyes decouples advantage normalization for on/off-policy streams to prevent reward assimilation.
vs. SFT Baselines (RadFM, etc.): MedEyes employs RL to refine reasoning dynamics rather than just mimicking static text-image pairs.

Limitations

Dependency on the quality of the 'expert model' used to generate off-policy trajectories via GRN
Complexity of the dual-stream optimization adds implementation overhead compared to standard SFT
Computational cost of generating and processing multi-step visual trajectories

Reproducibility

Code: https://github.com/zhcz328/MedEyes

Code is publicly available at https://github.com/zhcz328/MedEyes. The specific base model architecture used for the experiments is not explicitly named in the provided text snippet, though baselines suggest a VLM context.

📊 Experiments & Results

Evaluation Setup

Medical Visual Question Answering (VQA) across diverse modalities (Radiology, Pathology, etc.)

Benchmarks:

VQA-RAD (Radiology VQA)
SLAKE (Bilingual Medical VQA)
PathVQA (Pathology VQA)
PMC-VQA (Large-scale Medical VQA)
MMMU (Medical subset) (Multi-discipline Understanding)

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MedEyes consistently outperforms both general VLMs and specialized medical models across all 5 benchmarks.
VQA-RAD	Accuracy	56.4	70.7	+14.3
SLAKE	Accuracy	62.7	79.1	+16.4
PathVQA	Accuracy	56.8	64.8	+8.0
Average	Accuracy	57.4	65.9	+8.5
Average	Accuracy	46.1	65.9	+19.8

Experiment Figures

Comparison of reasoning paths between SFT, standard CoT, and MedEyes on a Pneumothorax case.

Main Takeaways

MedEyes enables initially weak models to achieve state-of-the-art performance, surpassing much larger models like GPT-4o on specific benchmarks (e.g., VQA-RAD).
The dual-stream GRPO effectively mitigates 'advantage collapse' where models typically ignore visual evidence in favor of language priors.
The 'Scanning-Drilling' exploration strategy aligns well with human clinical diagnosis, improving both interpretability and accuracy.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Vision-Language Models (VLMs)
Chain-of-Thought (CoT) reasoning

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of outputs for the same input, removing the need for a separate critic model

RLVR: Reinforcement Learning with Verifiable Rewards—using RL to optimize models based on objective correctness (e.g., correct final answer) rather than human preference labels

Off-policy: Learning from data generated by a different policy (in this case, constructed expert trajectories) rather than the model's current behavior

On-policy: Learning from data generated by the model's current policy during training

Nucleus sampling: A text generation strategy that samples from the smallest set of top tokens whose cumulative probability exceeds a threshold p

SFT: Supervised Fine-Tuning—training a model on a fixed dataset of inputs and target outputs

VQA: Visual Question Answering—the task of answering natural language questions about an image