Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment

📝 Paper Summary

Modularized RAG pipeline Industrial/Enterprise RAG

Retrieval fusion increases upstream recall but fails to improve end-to-end answer accuracy in production RAG systems due to downstream reranking saturation and context window truncation.

Core Problem

Retrieval fusion techniques (like multi-query + RRF) are often adopted to boost recall, but their effectiveness is rarely evaluated under strict production constraints like fixed reranking budgets and latency limits.

Why it matters:

Enterprise RAG systems operate under tight latency and cost constraints, making efficiency critical
Higher recall at the retrieval stage is meaningless if the relevant documents are discarded during reranking or truncation before reaching the LLM
Engineers need to know if the complexity and latency overhead of fusion is justified by actual downstream gains

Concrete Example: A user asks a short, ambiguous support question. A fusion system generates a reformulation that retrieves 15 new documents. However, because the reranker only accepts the top 10 candidates and the reformulation introduces redundant or slightly off-topic chunks, the original correct document is pushed out of the final Top-10 context.

Key Novelty

Production-Constrained Evaluation of RAG Fusion

Investigates the 'funnel effect' in RAG pipelines: tracking whether recall gains from fusion actually survive the bottlenecks of reranking and context truncation
Demonstrates that fusion often introduces redundancy (near-duplicate chunks) rather than diverse information, which neutralizes benefits when the context window is fixed

Evaluation Highlights

Fusion reduced Hit@10 accuracy from 0.51 (baseline) to 0.48 in several configurations, despite higher initial recall
Fusion variants showed no statistically significant improvement in Top-3 accuracy (p_adj ≥ 0.125) compared to single-query baselines
Added 0.89s of latency overhead per query due to rewriting and fusion logic, degrading tail latency without accuracy gains

Breakthrough Assessment

4/10

A valuable negative result paper for practitioners. It debunks the assumption that 'more recall is always better' in RAG, but does not propose a new method or breakthrough algorithm.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Generation over an enterprise knowledge base with fixed retrieval depth K and reranking budget

Inputs: User query Q and a fixed corpus of knowledge base documents

Outputs: A ranked list of Top-K text chunks used as context for generation

Pipeline Flow

Query Rewriting (LLaMA-based) -> Parallel Retrieval
Hybrid Retrieval (BM25 + Dense) -> Cross-Encoder Reranking
Fusion (RRF) -> Deduplication -> Truncation

System Modules

Query Rewriter

Generate a paraphrase or reformulation of the original user query

Model or implementation: LLaMA-based language model

Hybrid Retriever (Retrieval & Selection)

Retrieve candidate chunks using both lexical and semantic search

Model or implementation: BM25 (sparse) + Granite embedding model (dense)

Reranker (Retrieval & Selection)

Re-score retrieved candidates for higher precision

Model or implementation: FlashRank (Cross-Encoder)

Fusion Layer (Retrieval & Selection)

Combine ranked lists from Q1 and Q2

Model or implementation: Reciprocal Rank Fusion (RRF)

Novel Architectural Elements

Placement of fusion logic *after* per-query reranking to simulate production constraints where reranking capacity is the bottleneck

Modeling

Base Model: Granite (embeddings), FlashRank (reranker), LLaMA (query rewriting)

Compute: Median latency ~0.89s overhead for fusion logic

Comparison to Prior Work

vs. Standard RAG Fusion: This paper evaluates it under fixed compute/context budgets rather than unconstrained recall settings
vs. HyDE [not cited in paper]: HyDE generates hypothetical documents for retrieval; this paper generates query paraphrases
vs. Multi-Vector Retrieval [not cited in paper]: This paper fuses at the list level (RRF) rather than the vector level

Limitations

Evaluation performed on a synthetic dataset of 115 enterprise support queries
Tested only one fusion algorithm (RRF) and one reranker (FlashRank)
Results specific to the enterprise support domain; may differ for open-domain QA
Fixed retrieval depth of K=10 might be too restrictive for some use cases

Reproducibility

Not provided. The paper uses internal enterprise data and does not link to a public repository or provide model weights.

📊 Experiments & Results

Evaluation Setup

Enterprise support QA over a proprietary knowledge base

Benchmarks:

Synthetic Enterprise Queries (KB-grounded Question Answering) [New]

Metrics:

Hit@K (Top-1, Top-3, Top-10)
Jaccard Similarity (between query results)
Latency (seconds)
Statistical methodology: McNemar’s exact test with Benjamini–Hochberg correction (FDR=0.05)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
End-to-end accuracy metrics showing fusion failing to outperform single-query baselines after reranking.
Synthetic Enterprise Queries	Hit@10	0.51	0.48	-0.03
Synthetic Enterprise Queries	Top-3 Accuracy	Not reported in the paper	Not reported in the paper	Not reported in the paper
Diversity analysis showing that while fusion adds candidates, they are often redundant.
Synthetic Enterprise Queries	Jaccard Similarity	0	0.09	+0.09
Operational costs of adding fusion to the pipeline.
Synthetic Enterprise Queries	Latency Overhead	0	0.89	+0.89

Main Takeaways

Recall gains from fusion are neutralized by downstream reranking and truncation constraints
Fusion introduces 'semantic redundancy'—new documents are often near-duplicates that crowd out diverse information in the Top-K context
Reranking based on the reformulated query (Q2) can actively harm performance by over-aligning with drift in the rewrite
Fusion is only beneficial for a small slice of 'recall-scarce' queries, but these gains are not enough to justify the system-wide latency cost

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG pipelines (Retrieval, Reranking, Generation)
Familiarity with sparse (BM25) vs dense retrieval
Knowledge of rank fusion techniques like RRF

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

RRF: Reciprocal Rank Fusion—a method for combining multiple ranked lists of search results into a single unified list

BM25: A ranking function used by search engines to estimate the relevance of documents to a given search query based on keyword matching

Hit@K: A metric indicating whether at least one relevant document appears in the top K retrieved results

reranking: A second stage of retrieval where a more powerful model (cross-encoder) re-scores a small set of candidate documents

cross-encoder: A transformer model that processes the query and document simultaneously to output a relevance score, more accurate but slower than bi-encoders

latency: The time delay between a user's request and the system's response

Jaccard similarity: A statistic used for gauging the similarity and diversity of sample sets (intersection over union)

truncation: Cutting off the list of retrieved documents to fit within the language model's maximum context window

KB: Knowledge Base—the collection of documents the system searches through

recall: The fraction of relevant documents that are successfully retrieved