RLVR: Reinforcement Learning with Verifiable Rewards—training models using objective, binary feedback (pass/fail) rather than a learned reward model
GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy by comparing a group of outputs for the same input, estimating baselines from the group average
Unit Tests: Specific, executable checks generated from ground truth (e.g., 'text exists', 'math renders correctly') used as binary reward signals
Souping: Model Souping—averaging the weights of multiple fine-tuned models (often trained with different seeds) to improve robustness and performance
KaTeX: A fast math typesetting library for the web; used here to verify if OCR'd LaTeX equations render visually identical to the ground truth
VLM: Vision Language Model—a multimodal AI model capable of processing both image and text inputs
SFT: Supervised Fine-Tuning—the initial phase of training a model on labeled examples before applying reinforcement learning