MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

📝 Paper Summary

Modularized RAG pipeline

MultiHop-RAG is a new benchmarking dataset specifically designed to evaluate RAG systems on complex multi-hop queries that require retrieving and reasoning over multiple distinct evidence documents.

Core Problem

Existing RAG benchmarks (like RGB and RECALL) primarily evaluate simple single-hop queries where answers reside in a single document, failing to assess capabilities on complex queries requiring multi-document reasoning.

Why it matters:

Real-world queries often require connecting diverse information sources (e.g., comparing financial reports from two different years)
Current benchmarks do not expose failures in multi-hop retrieval or reasoning, potentially overestimating system performance
Standard similarity matching (cosine similarity) struggles when a query's answer depends on synthesizing disparate pieces of evidence rather than matching a single text chunk

Concrete Example: A financial analyst asks: 'Which company among Google, Apple, and Nvidia reported the largest profit margins in 2023?' A standard RAG system might retrieve a single document about one company's margin but fail to retrieve and compare all three necessary reports to answer correctly.

Key Novelty

Multi-Hop RAG Benchmark Construction via Generator-Validator Pipeline

Constructs a dataset specifically for multi-hop queries using real-world news articles as a knowledge base, unlike previous benchmarks focused on single-document retrieval
Uses a semi-automated pipeline where GPT-4 paraphrases factual evidence into 'claims', extracts 'bridge-entities' (shared topics), and generates questions that require linking these separate claims
Introduces four specific query types (Inference, Comparison, Temporal, Null) to test different reasoning capabilities beyond simple fact retrieval

Architecture

The data construction pipeline for MultiHop-RAG

Evaluation Highlights

Standard embedding models struggle significantly: the best model (Voyage-02) achieves only 0.7467 Hits@10 even with re-ranking
LLM reasoning is a major bottleneck: Llama-2-70b achieves only 28% accuracy on multi-hop queries using retrieved chunks
GPT-4 dominates reasoning tasks with 89% accuracy when given ground-truth evidence, while open-source models like Mixtral-8x7B lag behind at 36%

Breakthrough Assessment

7/10

Provides a necessary and missing resource (multi-hop benchmark) that reveals significant gaps in current RAG systems. While the construction method is standard (LLM-generated), the focus on multi-hop retrieval is high-impact.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Generation (RAG) for Multi-Hop Queries

Inputs: A multi-hop user query q and a corpus of documents D

Outputs: A generated answer based on a retrieved set of chunks Rq = {r1, ... rK}

Pipeline Flow

Dataset Collection (News API)
Evidence Extraction (Model-based)
Claim & Bridge Generation (GPT-4)
Query Generation (GPT-4)
Quality Assurance (Human + GPT-4)

System Modules

Evidence Extractor (Data Construction)

Extract factual sentences from news articles to serve as potential evidence

Model or implementation: fact-or-opinion-xlmr-el

Claim Generator (Data Construction)

Paraphrase evidence into clear claims and identify bridge-entities/topics

Model or implementation: GPT-4

Query Generator (Data Construction)

Generate multi-hop queries by linking claims via bridge-entities

Model or implementation: GPT-4

Novel Architectural Elements

Bridge-entity based query construction: Explicitly identifying shared entities/topics across documents to synthetically generate multi-hop dependencies

Modeling

Base Model: GPT-4 (for data generation)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RGB/RECALL: MultiHop-RAG explicitly targets multi-hop queries requiring multiple documents, whereas RGB/RECALL focus on single-evidence queries
vs. HotpotQA: MultiHop-RAG uses recent news articles to avoid LLM training data contamination (pre-training memorization), whereas HotpotQA uses Wikipedia [not cited in paper as a RAG benchmark but as a QA dataset]

Limitations

Ground truth answers are restricted to simple responses (yes/no, entity names) to facilitate accuracy metrics, excluding complex free-text answers
Supporting evidence is limited to a maximum of four pieces per query
Experiments utilize a basic RAG framework (LlamaIndex) without advanced agentic or query decomposition techniques

Reproducibility

Code: https://github.com/yixuantt/MultiHop-RAG/

The dataset and evaluation code are publicly available on GitHub. The paper details the data construction prompts in Appendix A. The news data source is publicly available via mediastack API.

📊 Experiments & Results

Evaluation Setup

RAG benchmarking using a knowledge base of 609 news articles (Sept-Dec 2023)

Benchmarks:

MultiHop-RAG (Multi-hop RAG Retrieval and Answering) [New]

Metrics:

MAP@K (Mean Average Precision)
MRR@K (Mean Reciprocal Rank)
Hit@K (Hit Rate)
Generation Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Retrieval performance results showing the difficulty of multi-hop retrieval even for top embedding models.
MultiHop-RAG	Hits@10	0.7059	0.7467	+0.0408
MultiHop-RAG	MRR@10	0.5477	0.5860	+0.0383
Generation/Reasoning performance results comparing LLMs when given retrieved context vs. perfect ground-truth context.
MultiHop-RAG	Accuracy	0.28	0.56	+0.28
MultiHop-RAG	Accuracy	0.36	0.89	+0.53

Experiment Figures

Generation accuracy broken down by query type (Inference, Comparison, Temporal, Null) for GPT-4 vs. Mixtral-8x7B

Main Takeaways

Significant gap exists in retrieval: Even the best embedding model + reranker only finds the necessary evidence 75% of the time (Hits@10)
Open-source models (Llama-2, Mixtral) struggle heavily with reasoning over multiple documents, even when perfect evidence is provided (max 36% accuracy)
GPT-4 shows robust reasoning (89% accuracy) given ground truth, suggesting the bottleneck for SOTA models is retrieval, while for open models it is both retrieval and reasoning
Models perform relatively well on Null queries (detecting unanswerable questions) but fail significantly on Comparison and Temporal queries

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) architecture
Vector embeddings and similarity search
Large Language Models (LLMs) for reasoning

Key Terms

multi-hop query: A question that requires retrieving and reasoning over multiple pieces of supporting evidence (often from different documents) to provide an answer

bridge-entity: A shared entity or topic (e.g., 'Federal Reserve') that links different pieces of evidence, enabling the construction of a multi-hop query

null query: A query designed to have no answer within the knowledge base, used to test a model's ability to avoid hallucination

Hits@K: A metric measuring the fraction of ground-truth evidence that appears in the top-K retrieved documents

Reranker: A second-stage retrieval model that re-scores the initial set of retrieved documents to improve precision

MRR@K: Mean Reciprocal Rank at K—a statistic measure for evaluating any process that produces a list of possible responses to a sample of queries

MAP@K: Mean Average Precision at K—measure of quality of information retrieval