_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
CLIP: Contrastive Language-Image Pre-training—a model trained to align image and text representations by maximizing similarity of correct pairs
Heatmap Processor: A module using Multi-Headed Attention to fuse original images with radiologist eye-gaze heatmaps
Mixup: A data augmentation technique that creates new training samples by taking a convex combination of two existing samples (here, original image and heatmap-augmented image)
Curriculum Learning: A training strategy where the difficulty or nature of training examples is gradually changed over time (e.g., introducing expert annotations slowly)
Modality Gap: The geometric distance between clusters of image embeddings and text embeddings in the shared vector space
InfoNCE: Information Noise Contrastive Estimation—a loss function used to learn representations by pulling positive pairs together and pushing negative pairs apart
R@1: Recall at Rank 1—the percentage of times the correct item is found as the top result in a retrieval task
Zero-shot Inference: Using a pre-trained model to classify samples into categories it hasn't explicitly seen during training, usually via text prompts
Linear Probing: Training a simple linear classifier on top of frozen pre-trained features to evaluate representation quality
MHA: Multi-Headed Attention—a mechanism allowing the model to jointly attend to information from different representation subspaces
UMAP: Uniform Manifold Approximation and Projection—a dimensionality reduction technique used to visualize high-dimensional data in 2D or 3D
Swin Transformer: A hierarchical Vision Transformer whose representation is computed with shifted windows