LVLM: Large Vision-Language Model—a model capable of processing both images and text to generate text outputs.
DoT: Decomposition-of-Thought—a prompting strategy that breaks a complex question into sequential sub-questions, specifically separating visual perception from logical reasoning.
CoT: Chain-of-Thought—a prompting method that encourages models to generate intermediate reasoning steps before the final answer.
Grounding: The process of linking abstract linguistic concepts (e.g., 'the highest bar') to specific concrete visual regions or features in an image.
Graphical Perception: The visual decoding process humans use to interpret charts, involving tasks like estimating length, position, or angle.
Visual Primitives: Basic visual attributes such as color, shape, spatial coordinates, and length that constitute complex visualizations.
InternVL: A specific Large Vision-Language Model architecture used as the backbone for fine-tuning in this paper.