KB-VQA: Knowledge-based Visual Question Answering—VQA tasks requiring external world knowledge beyond just image recognition
Variational EM: An iterative optimization method where the E-step approximates a posterior distribution of latent variables and the M-step maximizes the expected log-likelihood
Latent Variable: Variables that are not directly observed but are inferred from the observed data (here, the 'rough answer' and 'retrieved knowledge')
Rough Answer: An intermediate output generated by the LM (e.g., a caption or initial guess) used to query the knowledge base
Late Interaction: A retrieval mechanism (like ColBERT) that interacts query and document encodings at a fine-grained token level rather than compressing them into single vectors
PRRecall: Pseudo-Relevance Recall—a metric measuring if retrieved documents contain the ground truth answer string
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices
ColBERT: Contextualized Late Interaction over BERT—a neural retrieval model using late interaction