← Back to Paper List

Democratizing and Modernizing Information Access: From Open Rerankers to Scalable RAG Evaluation

R Pradeep
University of Waterloo
2025 (2025)
RAG Benchmark QA

📝 Paper Summary

Modularized RAG pipeline
This thesis develops open-source reranking components (RankZephyr) and standardized RAG benchmarking frameworks (Ragnarök, AutoNuggetizer) to democratize information access and enable scalable, reliable evaluation of generative systems.
Core Problem
Modern information access relies on proprietary black-box models that hinder reproducibility, lacks standardized benchmarking ecosystems, and faces difficulty in evaluating generative RAG answers scalably.
Why it matters:
  • Reliance on closed proprietary models (like GPT-4) for reranking creates barriers to innovation and concentrates power in a few tech companies
  • The absence of a unified framework for RAG makes comparing different systems ad-hoc and unreliable, slowing research progress
  • Traditional IR metrics (like nDCG) cannot measure the factual accuracy or completeness of long-form generative answers produced by RAG systems
Concrete Example: In RAG evaluation, a system might generate a fluent but factually incorrect answer. Standard lexical overlap metrics (like ROUGE) might score it highly if it shares words with the reference, while failing to detect that the key information nugget is missing or hallucinated.
Key Novelty
Open-Source Reranking & Automated RAG Evaluation (RankZephyr, Ragnarök, AutoNuggetizer)
  • Proposes Expando-Mono-Duo, a multi-stage ranking design pattern using T5 that balances effectiveness and cost through progressive candidate reduction
  • Introduces RankZephyr, an open-source listwise reranker distilled from proprietary LLMs to match GPT-4 effectiveness without black-box dependency
  • Develops AutoNuggetizer, a framework that automates nugget-based evaluation using LLMs to measure information recall in RAG answers scalably
Evaluation Highlights
  • RankZephyr (7B) matches or exceeds GPT-4 performance on TREC Deep Learning Track passage ranking tasks in zero-shot settings
  • AutoNuggetizer automates evaluation for TREC 2024 RAG Track, showing high correlation with human judgments for information recall
  • Generative retrieval (DSI) fails to scale effectively to large corpora (8.8M passages), lagging behind dual-encoder baselines
Breakthrough Assessment
9/10
Provides critical open-source infrastructure (RankZephyr, Ragnarök) and a verified evaluation methodology (AutoNuggetizer) adopted by TREC 2024, fundamentally enabling reproducible RAG research.
×