← Back to Paper List

EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis

Danli Shi, Weiyi Zhang, Jianchen Yang, Siyu Huang, Xiaolan Chen, Mayinuer Yusufu, Kai Jin, Sha Lin, Shunming Liu, Qing Zhang, M. He
The Hong Kong Polytechnic University, Swiss Federal Institute of Technology Lausanne, Clemson University, Centre for Eye Research Australia
arXiv.org (2024)
MM Pretraining Benchmark QA

📝 Paper Summary

Medical Multi-modal Foundation Models Ophthalmic Disease Diagnosis Visual-Language Pretraining
EyeCLIP is a foundation model pretrained on 2.77 million ophthalmic images across 11 modalities that aligns multi-modal visual data with clinical text to enable zero-shot diagnosis and cross-modal retrieval.
Core Problem
Existing ophthalmic foundation models typically focus on single modalities (like only fundus photos) or lack alignment between visual data and clinical text, limiting their ability to handle real-world multi-examination scenarios and long-tail diseases.
Why it matters:
  • Real-world clinical diagnosis relies on multiple aligned examinations (CFP, OCT, FFA) which current models treat in isolation
  • Long-tail and rare eye diseases lack sufficient labeled data for standard supervised learning, requiring strong zero-shot or few-shot capabilities
  • Systemic disease prediction (e.g., stroke, MI) from eye images is hindered by scarce positive samples in general populations
Concrete Example: A patient may undergo both Color Fundus Photography (CFP) and Optical Coherence Tomography (OCT). A model trained only on CFP cannot utilize the OCT data for diagnosis. EyeCLIP aligns these diverse modalities so a single encoder can process and relate them to text descriptions.
Key Novelty
Multi-modal Visual-Language Alignment with Shared Representation
  • Combines masked image reconstruction (MAE) for self-supervised learning with contrastive learning (CLIP) to align images with text
  • Uniquely adds an image-image contrastive loss to align different imaging modalities (e.g., CFP and OCT) from the same patient, learning a consistent patient representation across examinations
Evaluation Highlights
  • Achieves state-of-the-art zero-shot classification on 9 ocular datasets, with AUROCs up to 0.757 for Diabetic Retinopathy (vs. 0.654 for BioMedCLIP)
  • Outperforms RETFound and BioMedCLIP in few-shot systemic disease prediction (stroke, MI, dementia) using only 1-16 training samples
  • Demonstrates effective zero-shot cross-modal retrieval, achieving 50.9% Recall@Mean on Retina Image Bank image-to-text retrieval (vs. 45.3% for BioMedCLIP)
Breakthrough Assessment
8/10
Significant advancement in medical foundation models by successfully aligning 11 different ophthalmic modalities with text. Strong zero-shot performance on rare diseases and systemic prediction validates the multi-modal alignment approach.
×