VLMs: Vision-Language Models—AI models capable of processing both images and text to reason about visual content
OCR: Optical Character Recognition—technology that converts different types of documents, such as scanned paper documents, PDF files or images captured by a digital camera, into editable and searchable data
EM: Exact Match—an evaluation metric that counts a prediction as correct only if it is identical to the ground truth answer
BLEU: Bilingual Evaluation Understudy—a metric for evaluating the quality of text which has been machine-translated from one natural language to another
Visual Noise: Distortions applied to images to mimic real-world imperfections, such as Gaussian blur, salt-and-pepper noise, skew, and JPEG compression
Qwen2.5-VL: A specific family of large Vision-Language Models developed by Alibaba Cloud
Back-translation: Translating a translated text back to the original language to verify accuracy by comparing it with the original source