LMM: Large Multimodal Model—AI models capable of processing and generating both text and images (e.g., GPT-4V, LLaVA)
CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer
Thinking about Images: The traditional paradigm where an image is encoded once into static features, and all subsequent reasoning is done via text
Thinking with Images: The proposed paradigm where models actively generate, crop, or manipulate images as intermediate steps to support reasoning
SFT: Supervised Fine-Tuning—training a model on labeled examples to follow specific instructions or reasoning patterns
RL: Reinforcement Learning—training models via rewards/penalties to optimize complex behaviors
Visual Chain of Thought: A reasoning sequence where some links in the chain are visual (e.g., a generated diagram) rather than textual
Intrinsic Imagination: The ability of a model to internally generate new visual representations (like mental simulations) without relying on external tools or code