When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs

📝 Paper Summary

Modularized RAG pipeline

MAD-RAG is a training-free inference method that fixes "Attention Distraction" in multimodal RAG by decoupling visual grounding from context integration and mixing attention weights to preserve focus on relevant image regions.

Core Problem

Retrieval augmentation in LVLMs causes a failure mode called Attention Distraction (AD), where retrieved text suppresses visual attention globally and shifts focus away from question-relevant image regions.

Why it matters:

Existing RAG methods often degrade performance on questions the model could originally answer correctly without retrieval (Closed-book=1, RAG=0)
Prior solutions focus only on textual calibration or hallucination, overlooking cross-modal dynamics where text dominates visual evidence
Even high-quality retrieval can hurt performance if the model's internal attention mechanism misallocates focus due to the long context

Concrete Example: In a VQA task, an LVLM might correctly identify a visual detail (e.g., a specific bird species) without context. When relevant text is added via RAG, the model's attention shifts to background pixels or irrelevant regions, causing it to hallucinate or answer incorrectly despite having the correct text.

Key Novelty

MAD-RAG (Mitigating Attention Distraction)

Identifies 'Attention Distraction' as a distinct failure mode where retrieved text suppresses visual attention and misaligns it spatially
Decouples inference into two streams via a dual-question prompt: one question attends primarily to the image (grounding), the other integrates context
Injects attention weights from the image-focused stream into the context-aware stream during decoding to force the model to maintain visual focus

Evaluation Highlights

+4.76% to +9.20% absolute accuracy gains over vanilla RAG across OK-VQA, E-VQA, and InfoSeek benchmarks
Rectifies up to 74.68% of 'Attention Distraction' failure cases (where closed-book was correct but RAG failed)
Outperforms RAG-oriented baselines (CAD, ALFAR) and hallucination methods (VCD, DoLa) with negligible computational overhead (~10%)

Breakthrough Assessment

8/10

Identifies a fundamental mechanism failure (Attention Distraction) in multimodal RAG and provides a simple, effective, training-free fix that significantly recovers lost performance.

⚙️ Technical Details

Problem Definition

Setting: Knowledge-based Visual Question Answering (KB-VQA) with Retrieval-Augmented Generation

Inputs: Image I, Question Q, Retrieved Context C

Outputs: Generated textual answer

Pipeline Flow

Input Construction (Dual-Question Prompting)
Forward Pass with Attention Intervention
Generation

System Modules

Input Constructor

Create dual-question prompt sequence: [Image, Q_I, Context, Q_C]

Model or implementation: Prompt Engineering

Attention Mixer

Intervene in self-attention layers to blend visual attention from Q_I into Q_C

Model or implementation: Mathematical Operation (Attention Manipulation)

Novel Architectural Elements

Dual-question prompt structure [I, Q_I, C, Q_C] specifically designed to separate visual grounding (Q_I) from context integration (Q_C)
Layer-wise attention mixing mechanism that explicitly injects image-grounded attention patterns from Q_I into the reasoning process of Q_C

Modeling

Base Model: Evaluated on LLaVA-1.5 (7B, 13B) and Qwen2.5-VL (3B, 7B)

Key Hyperparameters:

alpha: 0.5 (Attention mixing weight)

Compute: Negligible inference overhead (9-11% latency increase)

Comparison to Prior Work

vs. CAD/AdaCAD: MAD-RAG intervenes on internal attention dynamics rather than output logits
vs. ALFAR: MAD-RAG promotes visual attention to fix distraction, whereas ALFAR assumes visual dominance and promotes context attention
vs. DoLa/VCD: MAD-RAG is specifically designed for RAG contexts where long text suppresses visual evidence, which general hallucination methods fail to address
+ 1 more
vs. AlignRAG: MAD-RAG is training-free and plug-and-play

Limitations

Relies on the assumption that the 'Image-Question' (Q_I) captures correct visual grounding
Introduces a small latency overhead (approx 10%) due to processing longer effective sequence
Performance depends on the mixing hyperparameter alpha, though it is relatively robust

📊 Experiments & Results

Evaluation Setup

Knowledge-based VQA with both Oracle and CLIP-based retrieval

Benchmarks:

OK-VQA (Knowledge-based VQA)
E-VQA (Encyclopedic VQA)
InfoSeek (Fine-grained Entity VQA)

Metrics:

Exact Match Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MAD-RAG consistently outperforms vanilla RAG and baselines across all datasets using LLaVA-1.5-13B.
OK-VQA	Accuracy	60.46	65.22	+4.76
E-VQA	Accuracy	58.40	67.60	+9.20
InfoSeek	Accuracy	29.95	36.13	+6.18
OK-VQA	Accuracy	61.35	65.22	+3.87
MAD-RAG specifically recovers performance on 'Attention Distraction' cases where Closed-Book was correct but RAG failed.
OK-VQA	Recovery Rate (Quadrant: Closed=1, RAG=0)	0.0	74.68	+74.68

Experiment Figures

Attention heatmaps comparing Closed-book, Vanilla RAG, and MAD-RAG.

Performance breakdown by success/failure quadrants (Closed-book vs. RAG).

Main Takeaways

MAD-RAG consistently outperforms existing baselines (CAD, ALFAR, etc.) across different model families (LLaVA, Qwen) and datasets.
The method is robust to context length, maintaining gains even as retrieved context grows, unlike baselines that degrade.
It effectively mitigates 'Attention Distraction' by recovering correct answers that were suppressed by retrieval, without sacrificing performance on cases where RAG was already helpful.

📚 Prerequisite Knowledge

Prerequisites

Transformer attention mechanisms (Query, Key, Value)
Retrieval-Augmented Generation (RAG)
Large Vision-Language Models (LVLMs)
Autoregressive decoding

Key Terms

Attention Distraction (AD): A failure mode where retrieved text suppresses global visual attention and shifts focus from relevant image regions to irrelevant ones

LVLM: Large Vision-Language Model—a model capable of processing both images and text to generate text responses

Dual-question formulation: MAD-RAG's prompt structure that duplicates the question token: one placed after the image for grounding, one after the context for integration

Visual grounding: The ability of a model to link its textual reasoning to specific, relevant regions in the input image

Attention mixing: A mechanism to linearly combine attention weights from two different sources (image-question and context-question) during decoding

Convex combination: A weighted average where coefficients sum to 1 (e.g., alpha * A + (1-alpha) * B)

Greedy decoding: A generation strategy that selects the highest probability token at each step

Oracle chunks: High-quality retrieved text segments known to contain relevant information, used to isolate generation failures from retrieval failures

Sink-token effects: A phenomenon in attention mechanisms where specific tokens (like the start token) absorb a disproportionate amount of attention without semantic meaning

RAG: Retrieval-Augmented Generation—providing external documents to a model to help it answer knowledge-intensive questions