MLLM: Multi-modal Large Language Model—AI models capable of processing and reasoning across both text and image inputs
CoT: Chain-of-Thought—a reasoning technique where the model generates intermediate steps before the final answer
Descriptive Information (DI): Textual content that explicitly describes observable elements in the diagram (e.g., 'There is a circle')
Implicit Property (IP): Geometric or spatial properties that require visual perception to identify (e.g., 'Lines AB and CD are parallel')
Essential Condition (EC): Specific numerical or algebraic measurements crucial for solving the problem (e.g., 'Length = 5')
GPT-4V: GPT-4 with Vision—a version of GPT-4 capable of analyzing images
Text-dominant: A problem version containing full redundant text, minimizing the need to look at the diagram
Vision-only: A problem version where the text is minimized, forcing the model to extract almost all information from the diagram