MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining

📝 Paper Summary

Medical Vision-Language Pretraining Fundus Imaging Dataset Construction Multi-modal Knowledge Transfer

MM-Retinal V2 introduces a multi-modal fundus dataset and the KeepFIT V2 method, which transfers expert knowledge from sparse image-text pairs into public categorical datasets via hybrid contrastive and generative injection.

Core Problem

Developing fundus foundation models currently relies on large-scale private image-text data because public data is scarce, typically unimodal (images only), or lacks rich textual descriptions.

Why it matters:

Current state-of-the-art models (e.g., RET-CLIP, VisionUnite) are trained on private clinical data that is not released, hindering open research
Existing public datasets mostly provide only categorical labels (e.g., 'glaucoma') rather than detailed diagnostic captions needed for vision-language alignment
Clinical diagnosis requires multi-modal imaging (CFP, FFA, OCT), but most existing works utilize only Color Fundus Photography (CFP)

Concrete Example: A standard classification model trained on public data might label an image simply as 'Diabetic Retinopathy', whereas an ophthalmologist's report would describe 'scattered microaneurysms and hard exudates in the macula'. Current public models fail to learn these fine-grained visual-linguistic associations.

Key Novelty

KeepFIT V2 (Knowledge-Enhanced Pretraining with Elite Knowledge Spark)

Utilizes a small, high-quality 'elite' dataset (MM-Retinal V2) as a 'knowledge spark' to guide pretraining on larger public datasets that possess only categorical labels
employs a 'Hybrid Image-Text Knowledge Injection' that combines contrastive learning (for global semantic concepts) and generative learning (for local appearance details) to align features

Architecture

The dataset construction pipeline for MM-Retinal V2.

Evaluation Highlights

Constructed MM-Retinal V2 with 17,341 total image-text pairs across CFP, FFA, and OCT modalities
Compiled MM-Retinal-Text, a text-only corpus of 452,000 ophthalmic utterances for domain-specific text encoder pretraining
Achieves competitive performance to models trained on massive private datasets (e.g., 190K+ pairs) by using only ~5K elite pairs per modality

Breakthrough Assessment

8/10

Significant contribution via the release of the first high-quality public multi-modal (CFP, FFA, OCT) image-text dataset, addressing a major scarcity bottleneck in medical VLP.

⚙️ Technical Details

Problem Definition

Setting: Vision-Language Pretraining (VLP) for Fundus Imaging

Inputs: Fundus images (CFP, FFA, OCT) and associated textual reports/captions

Outputs: Pretrained vision and text encoders capable of zero-shot classification and disease diagnosis

Pipeline Flow

Data Curation (MM-Retinal V2 construction)
Preliminary Textual Pretraining
Hybrid Image-Text Knowledge Injection

System Modules

Data Construction Pipeline

Extract and clean image-text pairs from diagram books and expert inputs

Model or implementation: Semi-automated pipeline (OCR + Regular Expressions)

Text Encoder

Encode ophthalmic knowledge prior to vision-language alignment

Model or implementation: Not explicitly named in text (likely Transformer-based)

Hybrid Knowledge Injection

Transfer knowledge from elite data to the model

Model or implementation: Hybrid visual feature matching

Novel Architectural Elements

Hybrid injection mechanism merging contrastive (global) and generative (local) objectives for medical knowledge transfer
Expert knowledge refinement loss (mentioned conceptually in text)

Comparison to Prior Work

vs. RETFound: Uses explicit vision-language pairing rather than just image self-supervision
vs. FLAIR: Uses real expert captions (MM-Retinal V2) instead of template-generated prompts
vs. RET-CLIP/VisionUnite: Achieves competitive performance using accessible small-scale 'elite' data (~17K pairs) and public categorical data, rather than massive private datasets

Limitations

Reliance on diagram books may introduce domain shift compared to raw clinical raw data
OCT images required manual collection due to scarcity in books
Performance metrics (accuracy/AUC) are claimed to be competitive but specific numbers are not present in the provided text snippet

Reproducibility

Code: https://github.com/lxirich/MM-Retinal

Dataset (MM-Retinal V2) and Model (KeepFIT V2) are publicly available via GitHub. The text subset MM-Retinal-Text is also released. Exact hyperparameters for pretraining are not in the provided text snippet.

📊 Experiments & Results

Evaluation Setup

Zero-shot classification, few-shot classification, and linear probing on fundus downstream tasks

Benchmarks:

MM-Retinal V2 (Dataset Construction) (Data release statistics) [New]

Metrics:

Number of image-text pairs
Vocabulary size
Number of disease categories
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper's primary quantitative contribution available in the text is the construction of the MM-Retinal V2 dataset, significantly expanding public resources.
MM-Retinal V2 (CFP Modality)	Image-Text Pairs	Not reported in the paper	6720	Not reported in the paper
MM-Retinal V2 (FFA Modality)	Image-Text Pairs	Not reported in the paper	5119	Not reported in the paper
MM-Retinal V2 (OCT Modality)	Image-Text Pairs	0	5502	+5502
MM-Retinal-Text	Text Utterances	Not reported in the paper	452000	Not reported in the paper

Experiment Figures

Statistical analysis of the MM-Retinal V2 dataset: Word frequency, caption length distribution, and vocabulary diversity.

Main Takeaways

MM-Retinal V2 is the first public dataset providing high-quality image-text pairs for three major fundus modalities: CFP, FFA, and OCT.
The text captions are linguistically rich, with 91.3% of OCT captions ranging from 1-50 words and diverse vocabulary, unlike template-based datasets.
The dataset covers over 96 fundus abnormalities, including rare diseases, enabling broader disease generalization than typical disease-specific datasets.
The KeepFIT V2 method demonstrates that a 'spark' of high-quality data can effectively guide pretraining on larger categorical datasets, offering a viable alternative to massive private data collection.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Pretraining (VLP) concepts (Contrastive Learning, Generative Learning)
Fundus imaging modalities (CFP, FFA, OCT)
Medical image analysis basics

Key Terms

CFP: Color Fundus Photography—standard 2D retinal imaging showing fundus structure

FFA: Fundus Fluorescein Angiography—imaging technique using dye to capture vascular changes in the retina

OCT: Optical Coherence Tomography—imaging technique providing cross-sectional views of retinal layers

KeepFIT V2: The proposed pretraining framework: Knowledge-Enhanced Pretraining for Fundus Image-Text V2

Elite Knowledge Spark: The concept of using a small, high-quality paired dataset to inject expert knowledge into a model trained primarily on coarser public data

Contrastive Learning: A learning method that aligns representations by pulling positive image-text pairs closer and pushing negative pairs apart

Generative Learning: A learning method where the model learns to generate text from images (or vice versa), forcing it to capture local details

VLP: Vision-Language Pretraining—training models on paired images and text to learn joint representations

MM-Retinal-Text: A large text-only dataset of ophthalmic knowledge constructed by the authors for pretraining the text encoder