Gar-meets-ragparadigm for zero-shot information retrieval

📝 Paper Summary

Modularized RAG pipeline Agentic RAG pipeline

RRR (Rewrite-Retrieve-Rerank) is an iterative zero-shot retrieval framework that cycles between RAG-based query rewriting and GAR-based retrieval to maximize recall, followed by LLM-based re-ranking to maximize precision.

Core Problem

Existing zero-shot retrieval paradigms like GAR and RAG struggle because high-recall retrieval is difficult without domain data, and high-precision re-ranking requires a good initial document set.

Why it matters:

Zero-shot settings lack training data, making it hard to calibrate retrieval scores or fine-tune dense retrievers
GAR (query expansion) depends heavily on the quality of generated context, while RAG (answer generation) depends on the quality of retrieved documents
Current methods treat rewrite, retrieve, and re-rank as separate stages without a feedback loop to refine the query intent

Concrete Example: For the query 'How diet soda could make us gain weight?', a standard retriever fetches irrelevant documents about 'whole grain intake' or 'caloric restriction'. RRR uses these initial results to generate a better query rewrite ('Effects of artificial sweeteners on weight gain'), which then retrieves the correct document about 'antioxidant-rich spices'.

Key Novelty

GAR-meets-RAG Recurrence (RRR)

Formulate retrieval as a recurring loop where a RAG model generates a query rewrite, which feeds into a GAR model for retrieval, which in turn feeds back into the RAG model
Use a relevance-based filtering step within the loop to remove spurious documents (false positives) before they corrupt the next query rewrite
Decouple recall and precision objectives: the iterative loop maximizes recall, while a final LLM-based re-ranker maximizes precision

Architecture

The iterative Rewrite-Retrieve-Rerank (RRR) workflow.

Evaluation Highlights

Achieves new state-of-the-art on 6 out of 8 BEIR datasets for Recall@100 and nDCG@10 metrics in zero-shot settings
Outperforms RankGPT by +17% relative gain in nDCG@10 on specific datasets within the BEIR benchmark
Improves Recall@100 on TREC-DL 20 by +3.5 points (58.6 vs 55.1) when increasing rewrite iterations from 1 to 5

Breakthrough Assessment

8/10

Significant performance gains on standard benchmarks without training. The iterative feedback loop between GAR and RAG is a clever, high-impact architectural pattern for zero-shot IR.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot document retrieval: given query q and corpus Z, output ranked list S of top-N relevant documents without accessing relevance labels

Inputs: Query q, corpus Z

Outputs: Ranked list of documents S

Pipeline Flow

Iterative Loop: Query Rewrite (RAG) → Retrieval (GAR) → Relevance Filtering → Feedback
Final Stage: Re-ranking

System Modules

Retriever (f) (Retrieval & Selection)

Fetch initial candidate documents based on the current query rewrite

Model or implementation: BM25 (Pyserini implementation)

Relevance Filter (σ) (Retrieval & Selection)

Score and filter retrieved documents to remove false positives before the next rewrite

Model or implementation: GPT-3.5-Turbo

Rewriter (g)

Generate a new, better query based on the original query and the filtered retrieved documents

Model or implementation: GPT-4 (token limit 20 in prompt)

Re-ranker (h)

Re-order the final accumulated set of documents to maximize precision

Model or implementation: GPT-3.5-Turbo (initial pass) + GPT-4 (top 30 pass)

Novel Architectural Elements

Recurrent feedback loop where RAG output (rewrite) drives GAR input (retrieval) and vice-versa
Interleaved relevance filtering step within the retrieval loop to maintain context purity

Modeling

Base Model: GPT-4 and GPT-3.5-Turbo

Training Method: Zero-shot inference with iterative prompting

Key Hyperparameters:

max_rewrites_Nrw: 5
relevance_threshold_tau: 1
sliding_window_size_w: 10
+ 3 more
sliding_window_step_s: 5
gpt_3.5_token_limit: 4097
gpt_4_token_limit: 8192

Compute: Inference only. Approximately 1 USD per input query using OpenAI APIs.

Comparison to Prior Work

vs. RankGPT: RRR adds an iterative rewrite-retrieve loop to improve recall *before* re-ranking, whereas RankGPT only re-ranks a fixed initial set
vs. Promptagator: RRR is purely zero-shot inference without training a specific retriever; Promptagator requires training a dense retriever on synthetic data
vs. GAR (standard) [not cited in paper]: Standard GAR generates context once; RRR iteratively refines context via retrieval feedback
+ 1 more
vs. Self-RAG [not cited in paper]: Self-RAG trains specific tokens for critique; RRR uses off-the-shelf LLMs with prompting for relevance filtering

Limitations

High inference cost (approx $1 USD per query) due to multiple LLM calls per iteration
High latency due to iterative synchronous calls to LLM APIs
Dependence on closed-source models (GPT-4) for optimal performance
Performance on some datasets (TREC-COVID, Signal-1M) is mixed compared to simple BM25

Reproducibility

No code URL provided in paper. Uses closed-source OpenAI models (GPT-3.5-Turbo, GPT-4). Prompt templates are provided in Appendix A. BM25 implementation uses Pyserini.

📊 Experiments & Results

Evaluation Setup

Zero-shot retrieval on standard benchmarks

Benchmarks:

BEIR (Diverse IR tasks (8 datasets selected: TREC-COVID, NFCorpus, etc.))
TREC-DL 20 (Passage Retrieval)

Metrics:

nDCG@10
Recall@100
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RRR outperforms baselines on the majority of BEIR datasets in terms of nDCG@10.
BEIR (TREC-NEWS)	nDCG@10	52.9	53.6	+0.7
BEIR (Robust04)	nDCG@10	57.6	67.4	+9.8
BEIR (SciFact)	nDCG@10	75.0	77.2	+2.2
On TREC-DL 20, RRR consistently outperforms baselines across nDCG metrics at different cutoffs.
TREC-DL 20	nDCG@10	70.6	72.3	+1.7
Ablation studies demonstrate the value of the iterative feedback loop and re-ranking components.
TREC-DL 20	Recall@100	55.1	58.6	+3.5
TREC-DL 20	nDCG@10	42.0	72.3	+30.3
TREC-DL 20	Recall@100	56.8	58.6	+1.8

Main Takeaways

The iterative rewrite-retrieve loop effectively increases recall by refining query intent based on initial document findings.
Re-ranking is absolutely critical for precision (nDCG); the recall-oriented loop alone provides a good candidate set but poor ordering.
Feedback from retrieved documents to the rewriter is essential; cutting this link drops performance.
Simple BM25 remains a strong baseline and a vital component of the pipeline, outperforming dense retrievers like DPR/ANCE in zero-shot settings when augmented with RRR.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Information Retrieval (IR) metrics (nDCG, Recall)
Familiarity with sparse vs. dense retrieval
Basic knowledge of Large Language Models (LLMs) and prompting

Key Terms

GAR: Generation-Augmented Retrieval—a paradigm where an LLM generates additional context (like query expansions) to help a retriever find documents

RAG: Retrieval-Augmented Generation—a paradigm where a retriever fetches documents to help an LLM generate an answer

Zero-shot IR: Information Retrieval tasks performed without any training data from the target domain

nDCG@k: Normalized Discounted Cumulative Gain at k—a measure of ranking quality that accounts for the position of relevant documents

Recall@k: The fraction of relevant documents retrieved within the top-k results

BM25: Best Matching 25—a standard probabilistic information retrieval function based on term frequency and inverse document frequency

Relevance Model: A component (often an LLM) that scores how relevant a document is to a query, used here for filtering

Promptagator: A baseline method that uses LLMs to generate synthetic queries from a corpus to train a dense retriever