RAG: Retrieval-Augmented Generation—systems that fetch external data to improve AI responses
LVLM: Language-Visual Large Model—AI models capable of understanding both text and images (e.g., GPT-4o, Qwen-VL)
Set of Marks (SoM): A prompting technique where objects in an image are overlaid with visible numeric markers to help the model reference specific regions
CIDEr: Consensus-based Image Description Evaluation—a metric for image captioning that measures consensus between a candidate caption and reference captions
BLEU: Bilingual Evaluation Understudy—a metric measuring text overlap between generated and reference text, often used for translation and captioning
METEOR: Metric for Evaluation of Translation with Explicit ORdering—a text evaluation metric that correlates better with human judgment than BLEU by using synonyms and stemming
Visual-RAG Alignment: The paper's proposed method of mapping retrieved text segments to specific visual regions using spatial prompts and markers
mRAG: A baseline multimodal RAG approach that typically uses image-to-text retrieval or simple query-based retrieval
KAC-dataset: Knowledge-Augmented Captioning dataset—a new benchmark introduced in this paper covering diverse domains like cultural relics and products