VLM: Vision-Language Model—a model capable of processing both image and text inputs to generate text or embeddings
VisRAG-Ret: The retriever component of VisRAG that encodes queries (text) and documents (images) into a shared embedding space
VisRAG-Gen: The generator component of VisRAG that takes the query and retrieved document images as input to generate an answer
InfoNCE loss: A contrastive loss function used to train the retriever to pull positive query-document pairs closer and push negatives apart
OCR: Optical Character Recognition—technology to convert images of text into machine-encoded text
TextRAG: The traditional RAG pipeline that relies on parsing documents into text segments for retrieval and generation
weighted mean pooling: A pooling strategy for variable-length sequences where later tokens (closer to the end of processing) are assigned higher weights
MRR@10: Mean Reciprocal Rank at 10—a measure of retrieval quality based on the rank of the first relevant document
Recall@10: The proportion of relevant documents found in the top-10 retrieved results