System-II: Slow, deliberate, and logical reasoning processes (as opposed to fast, intuitive System-I thinking), which this benchmark aims to evaluate.
MMLU: Massive Multitask Language Understanding—a popular benchmark for general knowledge and reasoning, now considered close to saturation.
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning benchmark, considered a standard for MLLMs.
CoT: Chain-of-Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer.
o1: OpenAI o1—a large language model trained specifically for complex reasoning tasks using reinforcement learning.
OCR: Optical Character Recognition—technology used to convert images of text into machine-encoded text.
Reasoning Tokens: Internal tokens generated by models like OpenAI o1 during their 'thought process' before outputting a visible response; used here as a proxy for question difficulty.