Med-LVLMs: Medical Large Vision-Language Models—AI systems adapted for medical tasks using both image and text inputs.
DPO: Direct Preference Optimization—an alignment method that optimizes a policy to favor preferred responses over dispreferred ones without a separate reward model.
BiomedCLIP: A vision-language foundation model pre-trained on biomedical image-text pairs, used here for domain identification and retrieval.
Gap statistic: A method typically used in clustering to estimate the optimal number of clusters; here adapted to find the optimal number of retrieved documents.
Cross-modality alignment: Ensuring the model respects and utilizes the visual input (medical image) rather than relying solely on the textual query or retrieved context.
RAG-PT: RAG-based Preference Tuning—the authors' proposed method of fine-tuning the generator using DPO on specific RAG-related failure cases.
Contrastive learning: A training technique that pulls representations of similar pairs (e.g., image and matching text) together and pushes dissimilar pairs apart.
Hallucination: The generation of text that is factually incorrect or nonsensical, a common failure mode in LLMs and VLMs.