linearization: The process of converting a 2D document layout (with columns, sidebars, floating figures) into a coherent 1D string of text that follows natural reading order
document-anchoring: A prompting technique where noisy text extracted from the PDF file's internal metadata is provided to the VLM alongside the page image to improve OCR accuracy and reduce hallucinations
VLM: Vision Language Model—a multimodal model capable of processing both images and text
OCR: Optical Character Recognition—the conversion of images of typed or handwritten text into machine-encoded text
pypdf: A python library used to extract internal structure and metadata from PDF files
SGLang: A high-performance inference engine for large language models and VLMs, used here for efficient batch processing
unit-test: In this paper, a deterministic pass/fail check used for evaluation (e.g., 'Does the output contain the string X?', 'Is string A before string B?')
NFC format: Normalization Form C—a Unicode normalization standard used to ensure consistent text representation
LaTeX: A typesetting system commonly used for scientific and mathematical documents; used here as a reference format for math formula tests