← Back to Paper List

When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs

B Zhao, W Deng, X Liao, Y Li, N Shaikh, Y Nie, X Li
University of British Columbia, Vector Institute
arXiv, 1/2026 (2026)
RAG MM QA

📝 Paper Summary

Modularized RAG pipeline
MAD-RAG is a training-free inference method that fixes "Attention Distraction" in multimodal RAG by decoupling visual grounding from context integration and mixing attention weights to preserve focus on relevant image regions.
Core Problem
Retrieval augmentation in LVLMs causes a failure mode called Attention Distraction (AD), where retrieved text suppresses visual attention globally and shifts focus away from question-relevant image regions.
Why it matters:
  • Existing RAG methods often degrade performance on questions the model could originally answer correctly without retrieval (Closed-book=1, RAG=0)
  • Prior solutions focus only on textual calibration or hallucination, overlooking cross-modal dynamics where text dominates visual evidence
  • Even high-quality retrieval can hurt performance if the model's internal attention mechanism misallocates focus due to the long context
Concrete Example: In a VQA task, an LVLM might correctly identify a visual detail (e.g., a specific bird species) without context. When relevant text is added via RAG, the model's attention shifts to background pixels or irrelevant regions, causing it to hallucinate or answer incorrectly despite having the correct text.
Key Novelty
MAD-RAG (Mitigating Attention Distraction)
  • Identifies 'Attention Distraction' as a distinct failure mode where retrieved text suppresses visual attention and misaligns it spatially
  • Decouples inference into two streams via a dual-question prompt: one question attends primarily to the image (grounding), the other integrates context
  • Injects attention weights from the image-focused stream into the context-aware stream during decoding to force the model to maintain visual focus
Evaluation Highlights
  • +4.76% to +9.20% absolute accuracy gains over vanilla RAG across OK-VQA, E-VQA, and InfoSeek benchmarks
  • Rectifies up to 74.68% of 'Attention Distraction' failure cases (where closed-book was correct but RAG failed)
  • Outperforms RAG-oriented baselines (CAD, ALFAR) and hallucination methods (VCD, DoLa) with negligible computational overhead (~10%)
Breakthrough Assessment
8/10
Identifies a fundamental mechanism failure (Attention Distraction) in multimodal RAG and provides a simple, effective, training-free fix that significantly recovers lost performance.
×