MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

📝 Paper Summary

Benchmark datasets Evaluation frameworks

mtRAG is a human-generated multi-turn RAG benchmark designed to expose LLM failures in handling context-dependent retrieval, unanswerable questions, and domain shifts through active retrieval during annotation.

Core Problem

Existing RAG benchmarks focus primarily on single-turn Q&A or use fixed retrieval contexts, failing to capture the complexity of real-world multi-turn conversations where retrieval needs evolve dynamically.

Why it matters:

Current systems excel at single-turn RAG but struggle when questions rely on previous turns or require new retrieval, leading to poor user experiences
Existing datasets often ignore unanswerable questions, which are a major source of hallucinations in production RAG systems
Benchmarks with static retrieval do not test the system's ability to handle active retrieval where the relevant passages change throughout the conversation

Concrete Example: In a multi-turn chat, a user might ask 'Who is the CEO of Apple?' followed by 'its address?'. A standard retriever might fail on 'its address' without context rewriting, or an LLM might hallucinate an address if the retrieved document only discusses the CEO. mtRAG captures these dependencies and failures.

Key Novelty

Human-Annotated Active Retrieval Benchmark (mtRAG)

Constructed by humans interacting with a live RAG system, allowing annotators to refine questions, retrieval results, and answers in real-time to ensure quality
Explicitly incorporates 'active retrieval' where relevant passages change between turns, unlike benchmarks that use a single fixed context for an entire conversation
Includes a diverse set of failure-inducing scenarios: unanswerable questions, non-standalone follow-ups, and four distinct domains (Finance, Government, Tech, Wikipedia)

Architecture

A snippet of a conversation from the mtRAG benchmark illustrating the data creation process.

Evaluation Highlights

All 9 tested LLMs (including GPT-4o and Llama 3.1 405B) perform significantly worse on unanswerable questions compared to answerable ones
Retrieval performance drops significantly for later conversation turns compared to the first turn (e.g., Elser Recall@5 drops from 95.5 to 73.1 in the Cloud domain)
Query rewriting consistently improves retrieval for non-standalone questions (improving Elser Recall@5 from 63.8 to 82.2 on non-standalone turns)

Breakthrough Assessment

8/10

Fills a critical gap in RAG evaluation by providing high-quality, human-verified multi-turn data with active retrieval dynamics. The inclusion of unanswerable questions and domain diversity makes it a robust stress test for modern systems.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn Information Seeking Conversation with Retrieval

Inputs: Conversation history (questions q_1...q_k, answers a_1...a_k-1) and a document corpus C

Outputs: Response a_k grounded in retrieved passages from C

Pipeline Flow

User Query (Turn k)
Query Rewriter (converts to standalone)
Retriever (fetches passages)
Generator (produces response)

System Modules

Query Rewriter

Reformulate context-dependent user questions into standalone queries

Model or implementation: LLM-based rewriter (implemented via prompting)

Retriever

Retrieve relevant passages from the indexed corpus

Model or implementation: Elser (Sparse), BGE-base 1.5 (Dense), or BM25 (Lexical)

Generator

Generate the final response based on retrieved passages

Model or implementation: Various LLMs (e.g., GPT-4o, Llama 3.1, Mixtral)

Modeling

Base Model: Evaluation of 9 models including Llama 3.1 (8B, 70B, 405B), Mixtral 8x22B, GPT-4o, Command R+, Qwen 2.5

Comparison to Prior Work

vs. MT-Bench: mtRAG adds retrieval and grounding requirements [not cited in paper]
vs. FaithDial: mtRAG uses active retrieval where passages change, whereas FaithDial often uses fixed context
vs. RGB: mtRAG focuses on multi-turn dependencies and context shifts rather than single-turn QA
+ 1 more
vs. CRUD-RAG: mtRAG emphasizes human-generated conversational flow and diverse domains rather than database operations [not cited in paper]

Limitations

Human evaluation is expensive and does not scale well compared to automated metrics
The 'I Don't Know' (IDK) judge used for metrics relies on heuristics/LLMs and may have imperfections
Only 110 conversations (842 tasks) total, which is smaller than some large-scale synthetic datasets
Evaluation is limited to English language tasks

Reproducibility

Code: https://github.com/ibm/mt-rag-benchmark

Benchmark data (conversations and corpora) is available at https://github.com/ibm/mt-rag-benchmark. The specific prompts used for the 'LLM-as-a-judge' metrics are described in the paper.

📊 Experiments & Results

Evaluation Setup

RAG pipeline evaluation across 4 domains (Finance, Govt, Tech, Wikipedia) with varying retrieval strategies

Benchmarks:

mtRAG (Multi-turn RAG) [New]
mtRAG-S (Synthetic Multi-turn RAG) [New]

Metrics:

Recall@K
nDCG@K
RB_alg (Reference-Based algorithmic score)
RB_llm (Reference-Based LLM judge)
Faithfulness (Reference-less LLM judge)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Retrieval experiments demonstrate the necessity of query rewriting for multi-turn conversations and show significant performance drops in later turns.
mtRAG	Recall@5	73.2	83.6	+10.4
mtRAG	Recall@5	72.4	83.6	+11.2
mtRAG	Recall@5 (Elser+Rewrite)	87.0	82.5	-4.5
mtRAG	Recall@5 (Elser+Rewrite)	84.3	82.2	-2.1
Generation experiments show that model performance degrades when moving from perfect context (Reference) to noisy retrieval (Full RAG).
mtRAG	RB_llm	4.15	3.84	-0.31
mtRAG	RB_llm	4.36	3.90	-0.46

Experiment Figures

Performance of LLMs (RB_alg score) broken down by Answerability, Turn Position, and Domain.

Main Takeaways

Models struggle significantly with unanswerable questions; while GPT-4o and Llama 3.1 405B handle them better than smaller models, they still show a large performance gap compared to answerable questions.
Query rewriting is essential for multi-turn RAG; using only the last turn results in poor retrieval performance due to missing context (e.g., coreferences).
Performance degrades in later turns of the conversation for both retrieval and generation, highlighting the difficulty of maintaining context over time.
Synthetic data (mtRAG-S) can complement human data, but human evaluation reveals nuances in 'faithfulness' and 'appropriateness' that automated metrics might miss.

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG pipelines (Retrieval, Generation)
Familiarity with standard IR metrics (Recall, nDCG)
Knowledge of LLM evaluation metrics (ROUGE, LLM-as-a-judge)

Key Terms

Active Retrieval: A setting where the system must perform new retrieval operations for subsequent turns in a conversation, rather than relying on a single initial context

Non-standalone question: A question that cannot be understood without the conversation history (e.g., 'Why did he do that?')

Query Rewriting: The process of transforming a non-standalone user query into a self-contained search query by incorporating context from previous turns

FANC: Faithfulness, Appropriateness, Naturalness, Completeness—the four quality criteria used for reference answers in this benchmark

Elser: A sparse retrieval model by ElasticSearch used for semantic search

Answerability: Categorization of whether a question can be answered fully, partially, or not at all based on the available documents

Factoid: A question type asking for specific facts or entities

Unanswerable: Questions for which the provided corpus does not contain the necessary information, requiring the model to refuse to answer