← Back to Paper List

Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment

Luigi Medrano, Arush Verma, Mukul Chhabra
Dell Technologies
arXiv, 3/2026 (2026)
RAG Benchmark

📝 Paper Summary

Modularized RAG pipeline Industrial/Enterprise RAG
Retrieval fusion increases upstream recall but fails to improve end-to-end answer accuracy in production RAG systems due to downstream reranking saturation and context window truncation.
Core Problem
Retrieval fusion techniques (like multi-query + RRF) are often adopted to boost recall, but their effectiveness is rarely evaluated under strict production constraints like fixed reranking budgets and latency limits.
Why it matters:
  • Enterprise RAG systems operate under tight latency and cost constraints, making efficiency critical
  • Higher recall at the retrieval stage is meaningless if the relevant documents are discarded during reranking or truncation before reaching the LLM
  • Engineers need to know if the complexity and latency overhead of fusion is justified by actual downstream gains
Concrete Example: A user asks a short, ambiguous support question. A fusion system generates a reformulation that retrieves 15 new documents. However, because the reranker only accepts the top 10 candidates and the reformulation introduces redundant or slightly off-topic chunks, the original correct document is pushed out of the final Top-10 context.
Key Novelty
Production-Constrained Evaluation of RAG Fusion
  • Investigates the 'funnel effect' in RAG pipelines: tracking whether recall gains from fusion actually survive the bottlenecks of reranking and context truncation
  • Demonstrates that fusion often introduces redundancy (near-duplicate chunks) rather than diverse information, which neutralizes benefits when the context window is fixed
Evaluation Highlights
  • Fusion reduced Hit@10 accuracy from 0.51 (baseline) to 0.48 in several configurations, despite higher initial recall
  • Fusion variants showed no statistically significant improvement in Top-3 accuracy (p_adj ≥ 0.125) compared to single-query baselines
  • Added 0.89s of latency overhead per query due to rewriting and fusion logic, degrading tail latency without accuracy gains
Breakthrough Assessment
4/10
A valuable negative result paper for practitioners. It debunks the assumption that 'more recall is always better' in RAG, but does not propose a new method or breakthrough algorithm.
×