VLM: Vision-Language Model—an AI model trained to understand and generate text based on visual inputs
P-probes: Perceptual probes—atomic questions testing one task-relevant visual attribute (e.g., 'How many red circles?') to verify if the model 'sees' the necessary facts
R-probes: Reasoning probes—text-only questions asking the model to apply a logical rule given explicit facts, testing logic without visual noise
Abstract puzzles: IQ tasks using geometric primitives (shapes, lines) and formal patterns
Natural puzzles: IQ tasks using real-world objects and scenes while maintaining the same logical category as abstract puzzles
SFT: Supervised Fine-Tuning—training a model on labeled examples to improve performance
Chain-of-Thought: A reasoning technique where the model generates intermediate steps before the final answer
Tinkercad: A web-based CAD platform used here to generate consistent 3D visualization puzzles