EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis

📝 Paper Summary

Medical Multi-modal Foundation Models Ophthalmic Disease Diagnosis Visual-Language Pretraining

EyeCLIP is a foundation model pretrained on 2.77 million ophthalmic images across 11 modalities that aligns multi-modal visual data with clinical text to enable zero-shot diagnosis and cross-modal retrieval.

Core Problem

Existing ophthalmic foundation models typically focus on single modalities (like only fundus photos) or lack alignment between visual data and clinical text, limiting their ability to handle real-world multi-examination scenarios and long-tail diseases.

Why it matters:

Real-world clinical diagnosis relies on multiple aligned examinations (CFP, OCT, FFA) which current models treat in isolation
Long-tail and rare eye diseases lack sufficient labeled data for standard supervised learning, requiring strong zero-shot or few-shot capabilities
Systemic disease prediction (e.g., stroke, MI) from eye images is hindered by scarce positive samples in general populations

Concrete Example: A patient may undergo both Color Fundus Photography (CFP) and Optical Coherence Tomography (OCT). A model trained only on CFP cannot utilize the OCT data for diagnosis. EyeCLIP aligns these diverse modalities so a single encoder can process and relate them to text descriptions.

Key Novelty

Multi-modal Visual-Language Alignment with Shared Representation

Combines masked image reconstruction (MAE) for self-supervised learning with contrastive learning (CLIP) to align images with text
Uniquely adds an image-image contrastive loss to align different imaging modalities (e.g., CFP and OCT) from the same patient, learning a consistent patient representation across examinations

Evaluation Highlights

Achieves state-of-the-art zero-shot classification on 9 ocular datasets, with AUROCs up to 0.757 for Diabetic Retinopathy (vs. 0.654 for BioMedCLIP)
Outperforms RETFound and BioMedCLIP in few-shot systemic disease prediction (stroke, MI, dementia) using only 1-16 training samples
Demonstrates effective zero-shot cross-modal retrieval, achieving 50.9% Recall@Mean on Retina Image Bank image-to-text retrieval (vs. 45.3% for BioMedCLIP)

Breakthrough Assessment

8/10

Significant advancement in medical foundation models by successfully aligning 11 different ophthalmic modalities with text. Strong zero-shot performance on rare diseases and systemic prediction validates the multi-modal alignment approach.

⚙️ Technical Details

Problem Definition

Setting: Pretraining a unified visual encoder on multi-modal images and text to learn aligned representations for downstream transfer tasks

Inputs: Multi-modal ophthalmic images (x_i) and corresponding clinical text reports (t_i)

Outputs: Aligned visual and textual embeddings f(x) and g(t) used for classification, retrieval, or VQA

Pipeline Flow

Input Processing (Images + Text Reports)
Hierarchical Keyword Extraction (Text Cleaning)
Joint Pretraining (Contrastive + Reconstruction Losses)
Downstream Adaptation (Zero-shot / Finetuning)

System Modules

Image Encoder (Joint Pretraining)

Extract visual features from any of the 11 ophthalmic modalities

Model or implementation: Vision Transformer (ViT-Large based on CLIP architecture)

Text Encoder (Joint Pretraining)

Extract semantic features from hierarchical clinical keywords

Model or implementation: Transformer-based text encoder (CLIP architecture)

Image Decoder (Joint Pretraining)

Reconstruct masked images for self-supervised learning

Model or implementation: MAE decoder

Novel Architectural Elements

Integration of Image-Image Contrastive Loss alongside Image-Text Contrastive and Reconstruction losses to explicitly align different imaging modalities from the same patient

Modeling

Base Model: CLIP (ViT-L/14 vision encoder) extended with MAE decoder

Training Method: Multi-objective self-supervised pretraining

Objective Functions:

Purpose: Align images with text descriptions.

Formally: L_img-text = Contrastive loss maximizing similarity between paired image-text embeddings
Purpose: Align different image modalities from the same patient.

Formally: L_img-img = Contrastive loss maximizing similarity between paired image-image embeddings
Purpose: Learn robust visual features from unlabeled data.

Formally: L_recon = MSE loss between original and reconstructed masked images

Adaptation: Linear probing or full finetuning for downstream tasks; Zero-shot inference via text prompts

Training Data:

2,777,593 multi-modal images from 128,554 patients across 227 hospitals
11,180 text reports processed into hierarchical keywords
Excludes low-quality images based on vascular structure visibility

Key Hyperparameters:

learning_rate: 0.001 (base, cosine decay)
batch_size: 200
warmup_epochs: 2
+ 2 more
loss_weights: lambda_img-text=0.75, lambda_img-img=0.75, lambda_recon=1.0
image_resolution: 224x224

Compute: Trained on one NVIDIA Tesla V100 (32GB) GPU for approximately four weeks

Comparison to Prior Work

vs. RETFound: EyeCLIP uses a single shared encoder for 11 modalities (vs. separate weights) and integrates text supervision (vs. image-only MAE)
vs. BioMedCLIP: EyeCLIP includes image-image alignment for multi-view consistency and is domain-specific to ophthalmology with massive private clinical data
vs. FLAIR [not cited in paper]: EyeCLIP aligns diverse modalities via patient-identity matching rather than just text matching

Limitations

Relies heavily on the quality of private training data which is not public
Text reports required cleaning and keyword extraction, indicating sensitivity to raw text noise
VQA capabilities require aligning with an LLM (Llama-2) via an additional finetuning step, not native to the base model

Reproducibility

Code: https://github.com/Michi-3000/EyeCLIP

Code is available at https://github.com/Michi-3000/EyeCLIP. The pretraining dataset (private clinical data from China) is not released. Downstream validation datasets are public. Model weights availability is not explicitly confirmed in the paper text, though code repo is provided.

📊 Experiments & Results

Evaluation Setup

Zero-shot, Few-shot, and Full-data supervised evaluation across 14 datasets

Benchmarks:

IDRiD, APTOS2019, MESSIDOR2 (Diabetic Retinopathy Classification)
Retina Image Bank (Rare Disease Classification / Multi-modal Retrieval)
UK Biobank (Systemic Disease Prediction (Stroke, MI, etc.))
OphthalVQA (Visual Question Answering)

Metrics:

AUROC (Area Under ROC)
AUPR (Area Under Precision-Recall)
Recall@K (Retrieval)
F1 Score / BLEU (VQA)
Statistical methodology: Two-sided t-tests comparing EyeCLIP with baselines; 95% Confidence Intervals reported

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot performance demonstrates EyeCLIP's superior generalization ability without task-specific training.
MESSIDOR2 (DR)	AUROC	0.654	0.757	+0.103
OCTID (OCT)	AUROC	0.589	0.800	+0.211
Full-data supervised finetuning shows EyeCLIP achieves SOTA on systemic disease prediction and rare diseases.
UK Biobank (Stroke)	AUROC	0.620	0.641	+0.021
Retina Image Bank (Rare Diseases)	AUROC	0.523	0.561	+0.038
Cross-modal retrieval tasks validate the quality of the learned joint embedding space.
Retina Image Bank	Mean Recall	45.3	50.9	+5.6

Experiment Figures

Radar plots and bar charts comparing Zero-shot performance.

Main Takeaways

Consistent superiority in Zero-shot settings: EyeCLIP beats BioMedCLIP and RETFound across most datasets without seeing training labels, proving effective semantic alignment.
Data Efficiency: Outperforms baselines significantly in 1-shot to 16-shot scenarios, making it highly valuable for rare diseases where data is scarce.
Multi-modal robustness: A single shared encoder handles 11 modalities effectively, matching or beating RETFound (which uses modality-specific weights) even on RETFound's native modalities (CFP/OCT).

📚 Prerequisite Knowledge

Prerequisites

Contrastive Language-Image Pre-training (CLIP)
Masked Autoencoders (MAE)
Self-supervised learning
Ophthalmic imaging modalities (CFP, OCT, FFA, etc.)

Key Terms

CLIP: Contrastive Language-Image Pre-training—a model that learns to associate images and text by maximizing similarity between correct pairs

MAE: Masked Autoencoder—a vision model that learns by reconstructing missing parts of an image

CFP: Color Fundus Photography—a common 2D imaging technique for the retina

OCT: Optical Coherence Tomography—a non-invasive imaging test that uses light waves to take cross-section pictures of the retina

FFA: Fundus Fluorescein Angiography—a diagnostic procedure using dye to examine blood circulation in the retina

AUROC: Area Under the Receiver Operating Characteristic curve—a performance metric for classification problems at various threshold settings

Recall@K: A retrieval metric measuring if the correct item appears in the top K returned results

Zero-shot: Testing a model on a task it was not explicitly trained for, often using class names as text prompts

Few-shot: Training a model with very few labeled examples per class (e.g., 1 to 16)

VQA: Visual Question Answering—a task where the model answers natural language questions about an image