OCR: Optical Character Recognition—technology that converts different types of documents, such as scanned paper documents, PDF files, or images captured by a digital camera, into editable and searchable data
VQA: Visual Question Answering—a task where a system is given an image and a question about the image, and must produce an answer
Instruction Tuning: Fine-tuning language models on datasets of (instruction, output) pairs to improve their ability to follow user commands
CLIP: Contrastive Language-Image Pre-training—a neural network trained on a variety of (image, text) pairs suitable for zero-shot learning
Hallucination: A phenomenon where a model generates content that is nonsensical or unfaithful to the source content (e.g., describing objects not present in the image)
CIDEr: Consensus-based Image Description Evaluation—a metric used to evaluate image captioning quality by comparing generated captions to human reference captions
BLEU: Bilingual Evaluation Understudy—a metric for evaluating the quality of text which has been machine-translated from one natural language to another
ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a set of metrics used to evaluate automatic summarization and machine translation software in natural language processing
METEOR: Metric for Evaluation of Translation with Explicit ORdering—a metric for the evaluation of machine translation output