Ragnarök: A reusableRAGframework and baselines for TREC 2024 retrieval-augmented generation track

📝 Paper Summary

Modularized RAG pipeline RAG Evaluation and Benchmarking

The authors introduce Ragnarök, a standardized open-source framework and battle arena for evaluating Retrieval-Augmented Generation systems, alongside a curated MS MARCO V2.1 dataset and new non-factoid topic sets for the TREC 2024 RAG Track.

Core Problem

Existing RAG systems are often proprietary, hard to reproduce, or lack standardized evaluation frameworks, while current datasets (like Wikipedia-based ones) are too small or contain excessive factoid queries that LLMs can memorize.

Why it matters:

Lack of standardization hinders large-scale implementation and fair comparison of academic RAG research
Current benchmarks often rely on short-form answers or limited corpora, failing to test the complexity required for real-world applications like Bing Search or Gemini
Proprietary nature of industrial systems prevents the community from analyzing or building upon state-of-the-art baselines

Concrete Example: A user asks 'what inspired pink floyd's the wall?'. Without a standardized framework, comparing how a BM25+GPT-4o pipeline differs from a RankZephyr+Command R+ pipeline in citation quality and answer detail is difficult due to varying input/output formats and retrieval corpora.

Key Novelty

Ragnarök Framework & TREC 2024 RAG Track

Establish a reusable, end-to-end RAG framework (Ragnarök) that standardizes Retrieval and Augmented Generation modules with sentence-level citations
Release MS MARCO V2.1, a deduplicated and segmented version of the massive web corpus, designed specifically for RAG rather than just passage ranking
Introduce a 'Chatbot Arena' style evaluation for RAG, where human annotators blind-test pairwise system outputs to determine win rates

Architecture

The high-level workflow of the Ragnarök framework.

Evaluation Highlights

Reduced near-duplicates in the MS MARCO V2 document collection by 8.35% through Locality Sensitive Hashing (LSH)
Created MS MARCO V2.1 Segment Collection containing over 113 million text segments using sliding window chunking
Curated 120 'TREC-RAGgy' topics from past Deep Learning tracks, focusing on long-form answers where 65% of queries start with 'what' or 'how'

Breakthrough Assessment

9/10

Foundational work for the TREC 2024 RAG Track. While not a new model architecture per se, it establishes the standard infrastructure, datasets, and evaluation methodology for the field moving forward.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering using Retrieval-Augmented Generation over a large web corpus

Inputs: User topic (query)

Outputs: JSON object containing an ordered list of references (retrieved segments) and a generated answer with sentence-level citations mapping to those references

Pipeline Flow

Input Processing: User Topic
Retrieval & Selection: Retrieval Module (BM25 + Reranker)
Generation: Augmented Generation Module (LLM with citation support)

System Modules

Retrieval Module

Fetch and rank relevant text segments from the corpus

Model or implementation: BM25 (initial retrieval) + RankZephyr (reranking)

Augmented Generation Module

Synthesize an answer using retrieved context

Model or implementation: GPT-4o or Command R+

Novel Architectural Elements

Standardized I/O definition enforcing sentence-level citations mapped to zero-based indexed references in a JSON schema
Integration of a 'Battle Arena' WebUI for blind pairwise human evaluation of full RAG pipelines

Modeling

Base Model: Modular framework supporting multiple models; Baselines include GPT-4o and Cohere Command R+

Training Method: Inference-only framework (Ragnarök itself is a harness; individual models may be pre-trained)

Adaptation: None (Prompt engineering only for baselines)

Trainable Parameters: None (for the framework itself)

Training Data:

MS MARCO V2.1 Document Collection (Deduplicated via LSH)
MS MARCO V2.1 Segment Collection (Sliding window: 10 sentences window, 5 sentences stride)

Key Hyperparameters:

retrieval_top_k_initial: 100
retrieval_top_k_final: 20
bm25_k1: 0.9
+ 4 more
bm25_b: 0.4
segment_window_size: 10 sentences
segment_stride: 5 sentences
reranker_passes: 3 (RankZephyr rho variant)

Compute: Not reported in the paper

Comparison to Prior Work

vs. LangChain/LlamaIndex: Ragnarök is specifically designed for standardized benchmarking and research reproducibility with integrated retrieval/reranking metrics, whereas others focus on application building
vs. FlashRAG: Ragnarök includes a WebUI for 'Battle Arena' human evaluation and enforces specific sentence-level citation I/O standards
vs. ALCE [not cited in paper]: Ragnarök focuses on end-to-end TREC-style evaluation on massive web corpora (MS MARCO V2.1), whereas ALCE focuses on citation quality on smaller datasets like ASQA

Limitations

Baselines rely on closed-source commercial APIs (GPT-4o, Command R+), limiting transparency
Human evaluation is expensive and slow compared to automated metrics
The 'Researchy' topic set lacks relevance judgments, relying on heuristic diversity sampling
Current baselines do not yet include a wide variety of open-source LLMs

Reproducibility

Code: https://github.com/castorini/ragnarok

Highly reproducible. Code is publicly available at https://github.com/castorini/ragnarok. The paper releases the specific MS MARCO V2.1 collection scripts, topic sets (TREC-RAGgy and TREC-Researchy), and baseline configurations (BM25 parameters). It relies on proprietary APIs (OpenAI, Cohere) for the generation baselines.

📊 Experiments & Results

Evaluation Setup

Retrieval-Augmented Generation on MS MARCO V2.1

Benchmarks:

TREC-RAGgy 2024 (Long-form QA with aggregation) [New]
TREC-Researchy 2024 (Complex/Multi-faceted QA) [New]

Metrics:

Human preference (Win Rate in Arena)
Response length
Citation count
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper focuses on establishing the framework and datasets rather than reporting final leaderboard scores. However, they provide corpus statistics and qualitative baseline comparisons.
MS MARCO V2.1	Document Count	11961528	10962355	-999173
MS MARCO V2.1	Segment Count	0	113520750	+113520750

Experiment Figures

Screenshot of the Ragnarök WebUI 'Battle Arena' comparison.

Main Takeaways

Qualitative observation: GPT-4o baselines tend to produce longer, more detailed answers than Command R+, though Command R+ cites more segments.
The MS MARCO V2.1 deduplication strategy (MinHash LSH) successfully reduced the corpus size by 8.35% while retaining representative documents.
The framework supports a 'blinded' evaluation mode where users vote on RAG outputs without knowing the underlying system, enabling fair 'Battle Arena' comparisons.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG) pipelines
Familiarity with IR metrics and datasets (MS MARCO, TREC)
Basic knowledge of LLM-based evaluation (LLM-as-a-judge)

Key Terms

RAG: Retrieval-Augmented Generation—systems that improve LLM generation by retrieving relevant external data to ground the answer

MS MARCO V2.1: A curated, deduplicated version of the Microsoft Machine Reading Comprehension dataset, specifically segmented for RAG tasks

BM25: Best Matching 25—a probabilistic information retrieval function used to rank documents based on query term frequency

LSH: Locality Sensitive Hashing—an algorithmic technique that hashes similar input items into the same buckets with high probability, used here for deduplication

MinHash: A technique used to estimate the similarity between two sets (Jaccard similarity), used within LSH for document deduplication

Shingles: Consecutive sub-sequences of tokens (e.g., 9-gram) used to represent documents for similarity estimation

RankZephyr: A specific open-source Large Language Model fine-tuned for the task of reranking retrieval results

LLM-as-a-judge: An evaluation method where a strong LLM (like GPT-4) is used to score or compare the quality of outputs from other models

Sliding window: A chunking technique where a fixed-size window moves over text with a specific stride (overlap) to create segments

Factoid queries: Questions that have a short, concise, factual answer (e.g., 'What is the capital of France?'), which this paper aims to avoid in favor of complex queries