Plackett-Luce (PL) Ranking: A probability model for ranking items where the probability of a permutation depends on the relative 'strength' (or reward) of the items
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
SFT: Supervised Fine-Tuning—training a model on labeled examples to establish baseline capabilities
GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies based on the relative performance of a group of outputs
KL-regularization: Kullback-Leibler regularization—a penalty term ensuring the trained policy does not diverge too drastically from a reference model
embedding-anchored selection: A method where the model generates a vector (anchor) and the system selects the external item (tool) with the closest vector representation
GroundingDINO: A vision-language model used for object detection and grounding text concepts in images
OCR: Optical Character Recognition—converting images of text into machine-encoded text
DeepSeek-R1: An expert reasoning model used in this paper to generate rationales for data curation