LMM: Large Multimodal Model—a foundation model capable of processing and reasoning over both text and images (e.g., GPT-4V, Bard)
CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer
PoT: Program-of-Thought—a prompting strategy where the model generates executable code (e.g., Python) to solve the problem
OCR: Optical Character Recognition—technology to convert text within images into machine-readable text formats
VQA: Visual Question Answering—the task of answering a natural language question based on the content of an image
Hallucination: A phenomenon where a model generates plausible-sounding but factually incorrect information or detects objects/relationships not present in the input
FQA: Figure Question Answering—answering questions based on statistical plots and charts
MathQA: Math-targeted Question Answering—datasets specifically designed to test mathematical problem solving
GPS: Geometry Problem Solving—tasks involving reasoning about geometric shapes and diagrams
TQA: Textbook Question Answering—tasks derived from educational materials, often requiring domain knowledge