OCR: Optical Character Recognition—technology that converts images of typed, handwritten, or printed text into machine-encoded text
VLM: Vision-Language Model—a model that combines computer vision and natural language processing to understand and generate content based on image and text inputs
LLM: Large Language Model—a deep learning algorithm that can recognize, summarize, translate, predict, and generate text
ANLS: Average Normalized Levenshtein Similarity—a metric commonly used in Visual Question Answering to measure the similarity between the predicted answer and the ground truth
CIDEr: Consensus-based Image Description Evaluation—a metric used to evaluate image captioning quality
learnable queries: Fixed vectors that act as 'slots' to aggregate information from a larger input source via attention mechanisms
prompt tuning: A technique where a small number of trainable parameters are added to the input prompt while keeping the rest of the model frozen
DocVQA: Document Visual Question Answering—a dataset for evaluating VQA on document images
DUDE: Document Understanding Dataset and Evaluation—a benchmark for multi-page document understanding