PRM: Process Reward Model—a critic model that scores each individual step of a reasoning chain rather than just the final outcome
MLLM: Multimodal Large Language Model—AI models capable of processing and reasoning with both text and images
BoN: Best-of-N—an evaluation strategy where the model generates N candidate solutions and a critic selects the best one
TTS: Test-Time Scaling—methods to improve model performance during inference (not training) by spending more compute, e.g., generating more candidates
ORM: Outcome Reward Model—a critic model that assigns a single score to the entire completed response
Monte Carlo sampling: A method used here to estimate step correctness by generating multiple future continuations from a step and averaging their final success rates
VisualPRM400K: The dataset constructed in this paper containing ~400K multimodal problems with step-level correctness labels
VisualProcessBench: The benchmark proposed in this paper containing human-annotated step-wise correctness labels for evaluation