Improving Medical Multi-modal Contrastive Learning with Expert Annotations

📝 Paper Summary

Medical Multi-modal Learning Contrastive Learning Data Augmentation with Expert Knowledge

eCLIP enhances medical CLIP models by incorporating radiologist eye-gaze heatmaps via a mixup strategy and curriculum learning to improve embedding alignment without altering the core architecture.

Core Problem

Medical image-text contrastive learning suffers from data scarcity and a 'modality gap,' where embeddings from different modalities (X-rays vs. reports) reside in distinct, non-overlapping regions.

Why it matters:

Standard CLIP models trained on internet data fail to capture nuanced medical abnormalities, grouping different pathologies too closely in embedding space
Acquiring large-scale, high-quality medical datasets is difficult due to privacy concerns and the need for expert annotation
The 'cone effect' restricts embeddings to narrow regions, hampering zero-shot classification and cross-modal retrieval performance

Concrete Example: In standard CLIP, embeddings for different chest X-ray abnormalities (e.g., cardiomegaly vs. atelectasis) often have cosine similarities near 1, making them indistinguishable. An X-ray of a collapsed lung might be retrieved as similar to a heart enlargement case due to poor spatial segregation.

Key Novelty

Expert-annotated CLIP (eCLIP)

Integrates scarce radiologist eye-gaze heatmaps as 'expert attention' signals to create high-quality, semantically rich positive pairs for contrastive learning
Uses a mixup strategy to blend original images with heatmap-augmented versions, effectively multiplying the small expert dataset to improve training density
Employs a curriculum learning schedule (cold start → warm up → cool down) to gradually introduce expert knowledge without destabilizing the base model's training

Architecture

The complete eCLIP training pipeline, illustrating how expert eye-gaze heatmaps are integrated via a Heatmap Processor and Mixup Augmentation.

Evaluation Highlights

Consistent improvement in Zero-shot Classification across 5 medical datasets (e.g., +4.6% accuracy on RSNA Pneumonia compared to base CLIP)
Enhanced retrieval performance: +4.6% improvement in R@1 (Recall at Rank 1) for image-to-text retrieval on the CheXpert dataset compared to base CLIP
Superior embedding quality: Reduces 'modality gap' distance by ~48% compared to base CLIP, creating more uniform and aligned representations

Breakthrough Assessment

7/10

A solid methodological contribution that creatively solves data scarcity using eye-tracking data and mixup. While specific to medical imaging, the approach to integrating expert attention is generalizable.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal contrastive representation learning for medical images and text

Inputs: Chest X-ray images I and corresponding radiology reports T, plus a small subset of radiologist eye-gaze heatmaps E

Outputs: Aligned image embeddings v and text embeddings t in a shared d-dimensional space

Pipeline Flow

Heatmap Processor (fuses image + heatmap)
Mixup Augmentation (blends original and expert images)
Image Encoder (CLIP-based)
Text Encoder (CLIP-based)
Contrastive Loss Calculation

System Modules

Heatmap Processor (Input Processing)

Augment images with expert attention. Uses MHA where heatmap-overlaid images are queries and original images are keys/values.

Model or implementation: Multi-Headed Attention (MHA) block

Mixup Augmentation (Input Processing)

Generate synthetic training samples by blending original and expert images to handle data scarcity.

Model or implementation: Linear interpolation: lambda * I + (1-lambda) * I_E

Image Encoder (Encoding)

Convert mixed images into d-dimensional embeddings.

Model or implementation: Swin Tiny / ViT-Small / ViT-Base (depending on experiment)

Text Encoder (Encoding)

Convert radiology reports into d-dimensional embeddings.

Model or implementation: Transformer-based Text Encoder (CLIP architecture)

Novel Architectural Elements

Integration of Heatmap Processor with MHA specifically for processing eye-gaze data
Curriculum Learning pipeline for injecting expert annotations: Cold Start -> Warmup -> Cooldown phases

Modeling

Base Model: Swin Tiny, ViT-Small, ViT-Base (Image Encoders initialized from various pretrained weights)

Training Method: Contrastive Learning with Mixup and Curriculum Learning

Objective Functions:

Purpose: Optimize similarity between positive image-text pairs and minimize it for negative pairs.

Formally: InfoNCE Loss L_i = -log( exp(sim(v_i, t_i)/tau) / sum(exp(sim(v_i, t_k)/tau)) )
Purpose: Ensure Heatmap Processor acts as identity function when no expert info is present (during priming).

Formally: L_priming = MSE(HeatmapProcessor(I, E=1), I)

Adaptation: Pretraining on MIMIC-CXR or Finetuning existing CLIP

Training Data:

MIMIC-CXR (~200K image-text pairs)
EGD-CXR (1080 eye-gaze heatmaps used for augmentation)
Validation/Test: CheXpert, RSNA Pneumonia, NIH CXR, Open-I

Key Hyperparameters:

mixup_alpha: 0.3 (for Beta distribution)
curriculum_cold_start: 10% of iterations
curriculum_warmup: 30% of iterations
+ 4 more
curriculum_cooldown: 40% of iterations
expert_probability_warmup: 0.05 to 0.5
expert_probability_cooldown: 0.1
priming_loss_weight: 0.1

Compute: Not reported in the paper

Comparison to Prior Work

vs. GLoRIA: eCLIP uses explicit expert eye-gaze heatmaps via mixup, whereas GLoRIA relies on attention aggregation from the model itself
vs. DACL: eCLIP creates positive pairs using expert-guided mixup, while DACL uses domain-agnostic mixup
vs. standard CLIP: eCLIP introduces expert-annotated positive pairs to densify the embedding space and reduce the modality gap

Limitations

Relies on the availability of eye-gaze data, which is extremely scarce (only ~1000 samples used)
Performance gains might be constrained by the small size of the expert annotation set
Requires a curriculum learning schedule which adds hyperparameters to tune
Limited to 2D medical imaging (Chest X-rays) in current evaluation

Reproducibility

Code: https://github.com/Yogesh-Kumar-M/eCLIP

Code is publicly available (https://github.com/Yogesh-Kumar-M/eCLIP). Uses public datasets (MIMIC-CXR, EGD-CXR, CheXpert, etc.), but MIMIC-CXR requires credentialed access. 1080 eye-gaze heatmaps are from EGD-CXR.

📊 Experiments & Results

Evaluation Setup

Zero-shot classification, Linear probing, and Cross-modal retrieval on Chest X-ray datasets

Benchmarks:

CheXpert 5x200 (Zero-shot Classification / Retrieval)
MIMIC 5x200 (Zero-shot Classification / Retrieval)
RSNA Pneumonia (Zero-shot Classification)
CXR 14x100 (Zero-shot Classification) [New]

Metrics:

Accuracy (Zero-shot)
AUC (Linear Probe)
Recall@K (Retrieval)
Modality Gap (Euclidean Distance)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot classification results demonstrate eCLIP's superiority over standard CLIP and other augmentation strategies across multiple datasets.
RSNA Pneumonia	Accuracy	47.7	52.3	+4.6
CheXpert 5x200	Accuracy	44.7	48.2	+3.5
MIMIC 5x200	Accuracy	45.1	48.5	+3.4
Retrieval experiments show improved alignment between image and text modalities.
CheXpert 5x200	R@1	24.5	29.1	+4.6
CheXpert 5x200	R@1	25.7	29.6	+3.9
Embedding quality analysis reveals reduced modality gap, indicating better fusion of modalities.
MIMIC-CXR (Test)	Modality Gap (Distance)	0.79	0.41	-0.38

Experiment Figures

Demonstration of the 'Modality Gap' and poor class separation in standard CLIP models trained on medical data.

Comparison of positive/negative pair creation strategies: CLIP vs. M2-Mixup vs. eCLIP.

Main Takeaways

eCLIP consistently outperforms standard CLIP and other mixup strategies (DACL, m3-mixup) in zero-shot classification and retrieval tasks.
The method is sample-efficient: using only 50% of the training data with eCLIP matches the performance of standard CLIP trained on 100% data.
Integrating expert heatmaps significantly reduces the 'modality gap', leading to more uniform and aligned embeddings.
eCLIP improves the generation of radiology reports when used with a frozen LLM (RAG setting), suggesting better retrieval of relevant clinical context.

📚 Prerequisite Knowledge

Prerequisites

Contrastive Learning (CLIP, InfoNCE loss)
Data Augmentation (Mixup)
Attention mechanisms (Multi-headed Attention)
Curriculum Learning

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

CLIP: Contrastive Language-Image Pre-training—a model trained to align image and text representations by maximizing similarity of correct pairs

Heatmap Processor: A module using Multi-Headed Attention to fuse original images with radiologist eye-gaze heatmaps

Mixup: A data augmentation technique that creates new training samples by taking a convex combination of two existing samples (here, original image and heatmap-augmented image)

Curriculum Learning: A training strategy where the difficulty or nature of training examples is gradually changed over time (e.g., introducing expert annotations slowly)

Modality Gap: The geometric distance between clusters of image embeddings and text embeddings in the shared vector space

InfoNCE: Information Noise Contrastive Estimation—a loss function used to learn representations by pulling positive pairs together and pushing negative pairs apart

R@1: Recall at Rank 1—the percentage of times the correct item is found as the top result in a retrieval task

Zero-shot Inference: Using a pre-trained model to classify samples into categories it hasn't explicitly seen during training, usually via text prompts

Linear Probing: Training a simple linear classifier on top of frozen pre-trained features to evaluate representation quality

MHA: Multi-Headed Attention—a mechanism allowing the model to jointly attend to information from different representation subspaces

UMAP: Uniform Manifold Approximation and Projection—a dimensionality reduction technique used to visualize high-dimensional data in 2D or 3D

Swin Transformer: A hierarchical Vision Transformer whose representation is computed with shifted windows