LLaVA: Large Language-and-Vision Assistant—a popular open-source framework for training visual instruction-following models
Phi-2: A highly capable small language model (2.7B parameters) from Microsoft, trained on textbook-quality data
CLIP: Contrastive Language-Image Pre-training—a model that learns to associate images and text, used here as the vision encoder
SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs to follow instructions
MLP: Multilayer Perceptron—a simple neural network layer used here to project visual features into the language model's embedding space
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique (mentioned as future work)
ScienceQA: A benchmark dataset consisting of science questions with corresponding images and explanations
Hallucination: When a model generates plausible but incorrect or non-existent information (e.g., describing objects not present in the image)