MLLM: Multimodal Large Language Model—an AI model capable of processing and reasoning over both text and image inputs
CoT: Chain-of-Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer
Test-time compute scaling: Techniques to improve model performance during inference (not training) by spending more computational resources, such as generating multiple answers and voting (Majority Voting)
Visual reasoning: The ability to manipulate, analyze, and infer conclusions from visual inputs (e.g., spatial rotation, path tracing), distinct from merely recognizing objects
Organic multimodal reasoning: Reasoning that requires integrating complementary information from both text and vision, where neither modality is sufficient on its own
EMMA: Enhanced MultiModal ReAsoning—the benchmark introduced in this paper
SoTA: State-of-the-Art—the current best performing models or methods