← Back to Paper List

MedEyes: Learning Dynamic Visual Focus for Medical Progressive Diagnosis

Chunzheng Zhu, Yangfang Lin, Shen Chen, Yijun Wang, Jianxin Lin
Not explicitly reported in the paper
arXiv (2025)
MM RL Reasoning Agent

📝 Paper Summary

Medical Vision-Language Models Visual Chain-of-Thought (CoT)
MedEyes improves medical visual reasoning by combining autonomous exploration with structured expert eye-tracking trajectories using a dual-stream reinforcement learning framework that prevents reasoning collapse.
Core Problem
Pure on-policy reinforcement learning in medical models often suffers from 'advantage collapse,' generating plausible text without looking at relevant image regions, while supervised fine-tuning overfits to fixed paths.
Why it matters:
  • Medical diagnosis requires progressive visual focusing (scanning then drilling) which standard models fail to replicate, leading to 'cognitive traps' (repetitive low-quality reasoning)
  • Lack of explicit grounding between reasoning steps and visual evidence triggers information loss and visual hallucinations in complex imaging tasks
  • Naive behavior cloning of expert trajectories mimics actions without capturing underlying reasoning logic, limiting generalization to new cases
Concrete Example: In a pneumothorax case (Fig. 1), an SFT model yields vague responses, while a standard CoT model generates a plausible but incorrect path ignoring the actual lesion. MedEyes actively scans for abnormalities and 'drills' down for analysis, correctly locating the issue.
Key Novelty
Hybrid RL with Dual-Stream Advantage Decoupling
  • Simulates clinician workflows via a Gaze-guided Reasoning Navigator (GRN) that switches between 'scanning' (broad search) and 'drilling' (focused analysis) modes based on confidence
  • Decouples optimization gradients for on-policy exploration and off-policy expert guidance (Dual-stream GRPO) to prevent expert data from overwhelming the model's self-learning capability
  • Uses a Confidence Value Sampler (CVS) with nucleus sampling to generate diverse, high-quality expert trajectories that serve as 'cognitive anchors' during training
Evaluation Highlights
  • Achieves +8.5pp average improvement across five medical VQA benchmarks compared to the best baseline GMAI-VL
  • Outperforms Qwen2.5-VL-3B by +23.4pp on VQA-RAD (70.7 vs 47.3) and +22.9pp on SLAKE (79.1 vs 56.2)
  • Surpasses recent medical reasoning models like Med-R1 and DeepEyes across all tested datasets (e.g., +14.3pp vs DeepEyes on VQA-RAD)
Breakthrough Assessment
8/10
Significant performance jumps over strong baselines (GPT-4o, Med-R1) and a methodologically sound approach to the 'advantage collapse' problem in RLVR by integrating structured expert priors.
×