MLLM: Multimodal Large Language Model—AI models capable of processing and generating both text and image data.
CIDEr: Consensus-based Image Description Evaluation—a metric for image captioning that measures similarity to human consensus, weighing n-grams by TF-IDF.
BLEU-4: Bilingual Evaluation Understudy—a metric measuring the overlap of 4-word sequences (n-grams) between generated text and reference text.
QLoRA: Quantized Low-Rank Adaptation—a memory-efficient fine-tuning technique that backpropagates gradients through a frozen, quantized 4-bit pre-trained model into small low-rank adapters.
Zero-shot: Asking the model to perform a task without providing any examples in the prompt.
Few-shot: Providing a small number of examples (e.g., 3 image-caption pairs) in the prompt to guide the model.
LoRA: Low-Rank Adaptation—a technique to fine-tune large models by injecting trainable low-rank matrices into layers while freezing the main weights.
NLLB: No Language Left Behind—a state-of-the-art multilingual machine translation model.