SFT: Supervised Fine-Tuning—training a pre-trained model on a labeled dataset to adapt it to specific tasks
CoT: Chain-of-Thought—a reasoning method where the model generates intermediate steps before the final answer
T2I: Text-to-Image—generation tasks where a model creates an image based on a textual description
BLEU: Bilingual Evaluation Understudy—a metric for evaluating the quality of text which counts matching n-grams between candidate and reference
CLIPScore: A metric that measures the semantic similarity between an image and a text caption using the CLIP model embeddings
DeepSeek-R1: A series of reasoning-oriented Large Language Models known for strong performance in logic and mathematics
Visual Text Images: Images containing significant textual information, such as posters, user interfaces (UI), and textbook pages
Structured Images: Images representing structured data, such as geometric diagrams, mathematical equations, tables, and charts