LLaVA: Large Language-and-Vision Assistant—a multimodal model connecting a vision encoder (like CLIP) to an LLM via a projector
[CLS] token: A special token in transformer architectures (like BERT/ViT) often used to aggregate global sequence information into a single vector
Token Explosion: The rapid increase in sequence length when multiple images are tokenized into hundreds/thousands of patches, overwhelming model memory
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and trains small rank-decomposition matrices
SBERT: Sentence-BERT—a modification of the BERT network using siamese structures to derive semantically meaningful sentence embeddings
Self-distillation: A process where a model teaches itself (or a compressed version of itself) using its own predictions as targets
HitRatio@1 (HR@1): Evaluation metric measuring the percentage of test cases where the top recommended item matches the ground truth
Vision Tower: The component of a VLM (usually a ViT) that encodes raw images into feature embeddings