VLM: Vision-Language Model—a model capable of processing and understanding both images and text inputs
OCR: Optical Character Recognition—conversion of images of typed, handwritten, or printed text into machine-encoded text
RAG: Retrieval-Augmented Generation—systems that improve LLM outputs by referencing external knowledge bases
RT-DETR: Real-Time DEtection TRansformer—an efficient object detection architecture used here as the backbone for layout analysis
NaViT: Native Resolution Vision Transformer—a visual encoder that processes images at their original aspect ratios to avoid resizing artifacts
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that updates policies based on relative performance within a group of samples, used here to align output styles
SFT: Supervised Fine-Tuning—training a model on labeled datasets to specialize it for specific tasks
Text Spotting: The task of simultaneously detecting the location of text and recognizing its content
Mask-based detection: Predicting pixel-level shapes (masks) rather than just rectangular boxes, essential for non-rectangular (warped) elements
Global Pointer Mechanism: A technique used here to predict the reading order by modeling the precedence relationships between document elements