Attention Distraction (AD): A failure mode where retrieved text suppresses global visual attention and shifts focus from relevant image regions to irrelevant ones
LVLM: Large Vision-Language Model—a model capable of processing both images and text to generate text responses
Dual-question formulation: MAD-RAG's prompt structure that duplicates the question token: one placed after the image for grounding, one after the context for integration
Visual grounding: The ability of a model to link its textual reasoning to specific, relevant regions in the input image
Attention mixing: A mechanism to linearly combine attention weights from two different sources (image-question and context-question) during decoding
Convex combination: A weighted average where coefficients sum to 1 (e.g., alpha * A + (1-alpha) * B)
Greedy decoding: A generation strategy that selects the highest probability token at each step
Oracle chunks: High-quality retrieved text segments known to contain relevant information, used to isolate generation failures from retrieval failures
Sink-token effects: A phenomenon in attention mechanisms where specific tokens (like the start token) absorb a disproportionate amount of attention without semantic meaning
RAG: Retrieval-Augmented Generation—providing external documents to a model to help it answer knowledge-intensive questions