Aragog: Advancedragoutput grading

📝 Paper Summary

Modularized RAG pipeline

This study experimentally evaluates multiple advanced RAG techniques, finding that Sentence Window Retrieval and HyDE offer the best retrieval precision, while Maximal Marginal Relevance and Multi-query strategies often fail to outperform a naive baseline.

Core Problem

Current RAG literature focuses on systematic reviews or SOTA comparisons, lacking comprehensive experimental benchmarks that isolate the impact of specific retrieval techniques like HyDE, MMR, and reranking on precision and answer similarity.

Why it matters:

Developers lack empirical guidance on which RAG components actually improve retrieval quality versus just adding complexity.
Systematic comparisons are needed to understand trade-offs between retrieval precision (finding the right context) and answer similarity (generating the right text).
Some popular techniques (like Multi-query) may degrade performance in specific setups, but this is rarely documented in broad reviews.

Concrete Example: When a naive RAG system retrieves irrelevant chunks for a technical question, the LLM hallucinates an answer. The study tests if adding a reranker (like Cohere) or changing the chunking strategy (Sentence Window) fixes this, finding that Sentence Window improves precision significantly while Multi-query actually hurts it.

Key Novelty

Head-to-head empirical benchmarking of advanced RAG modules

Systematically compares distinct RAG strategies (Naive, Sentence Window, HyDE, Multi-query, MMR, Reranking) on a controlled dataset of AI papers.
Decouples evaluation into Retrieval Precision (did we get the right text?) and Answer Similarity (did we write the right answer?) to diagnose component-level failures.
Demonstrates that 'advanced' techniques like Multi-query do not universally improve performance and can degrade precision compared to simpler baselines.

Architecture

Standard RAG workflow illustrating the baseline system

Evaluation Highlights

Sentence Window Retrieval achieves the highest median retrieval precision (~0.85-0.90 range), significantly outperforming Naive RAG.
HyDE (Hypothetical Document Embedding) combined with LLM Rerank statistically significantly outperforms Naive RAG in retrieval precision.
Multi-query approaches underperformed Naive RAG in retrieval precision, contradicting common assumptions about query expansion benefits.

Breakthrough Assessment

4/10

Valuable exploratory analysis and benchmarking of existing techniques rather than a new architectural breakthrough. Provides useful negative results (MMR/Multi-query performance) for practitioners.

⚙️ Technical Details

Problem Definition

Setting: Question Answering over a specific domain corpus (AI research papers) with noise documents included.

Inputs: Natural language question q relating to specific technical details in AI papers.

Outputs: Retrieved context chunks and a generated answer.

Pipeline Flow

Input Query Processing (Naive, HyDE, or Multi-query)
Retrieval (Standard Vector Search or MMR)
Post-Retrieval Processing (Reranking via Cohere or LLM)
Generation (GPT-3.5-turbo)

System Modules

Chunking/Indexing

Prepare database

Model or implementation: Various (TokenTextSplitter, SentenceWindowNodeParser)

Query Expansion/Transformation (Retrieval & Selection)

Modify query for better retrieval

Model or implementation: HyDE or Multi-query (LLM-based)

Retriever (Retrieval & Selection)

Fetch initial candidates

Model or implementation: Vector similarity search

Reranker (Retrieval & Selection)

Re-order retrieved chunks

Model or implementation: Cohere Rerank OR LLM Rerank

Generator

Produce final answer

Model or implementation: GPT-3.5-turbo

Novel Architectural Elements

Comparative framework integrating multiple distinct retrieval paradigms (Sentence Window, HyDE, Document Summary Index) into a single evaluation harness.

Modeling

Base Model: GPT-3.5-turbo (for generation and LLM reranking)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RAGAS: This paper uses Tonic Validate (Retrieval Precision/Answer Similarity) instead of RAGAS metrics (Faithfulness/Answer Relevance), citing complexity/reliability concerns with RAGAS.
vs. General Benchmarks: Focuses on component isolation (e.g., just reranker, just chunking strategy) rather than end-to-end system performance on public leaderboards.

Limitations

Evaluation relies on GPT-3.5-turbo, which may limit the ceiling of generation quality and evaluation precision compared to GPT-4.
The Answer Similarity metric showed low correlation with Retrieval Precision in Sentence Window experiments, complicating interpretation.
Cost and latency analysis for the different techniques (especially LLM Rerank and HyDE) is mentioned qualitatively but not quantified.
The dataset is limited to 107 QA pairs from 13 papers (plus noise), which is relatively small.

Reproducibility

Code: https://github.com/predicthq/aragog

publicly available (https://github.com/predicthq/aragog). Includes the dataset of 107 QA pairs and evaluation code. Evaluation relies on Tonic Validate platform/package.

📊 Experiments & Results

Evaluation Setup

QA over a vector database of AI papers (13 relevant + 410 noise).

Benchmarks:

Custom AI ArXiv Dataset (Technical Question Answering) [New]

Metrics:

Retrieval Precision (0-1)
Answer Similarity (0-5)
Statistical methodology: ANOVA and Tukey’s HSD (Honestly Significant Difference) tests performed on 10 runs per technique.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis of retrieval precision across different RAG techniques.
Custom AI ArXiv Dataset	Retrieval Precision	0.60	0.90	+0.30
Custom AI ArXiv Dataset	Retrieval Precision	0.60	0.75	+0.15
Custom AI ArXiv Dataset	Retrieval Precision	0.60	0.40	-0.20

Experiment Figures

Boxplots of Retrieval Precision scores across all tested RAG techniques.

Boxplots of Answer Similarity scores across all tested RAG techniques.

Main Takeaways

Sentence Window Retrieval consistently achieved the highest Retrieval Precision, surpassing Document Summary Index and Naive RAG.
HyDE and LLM Reranking provided statistically significant improvements over Naive RAG, though at higher latency/cost.
Maximal Marginal Relevance (MMR) and Cohere Rerank did not show statistically significant improvements over Naive RAG in this specific setup.
High retrieval precision did not always correlate with high answer similarity, particularly for Sentence Window Retrieval, suggesting potential disconnects between retrieved context format and generation capabilities.

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG (Retrieval-Augmented Generation) workflows
Familiarity with vector databases and embeddings
Basic knowledge of LLM evaluation metrics (Precision, Similarity)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents.

HyDE: Hypothetical Document Embedding—generating a fake answer to a query, embedding that answer, and using it to search for real documents.

MMR: Maximal Marginal Relevance—a retrieval method that selects documents based on a trade-off between relevance to the query and diversity among the selected documents.

Sentence Window Retrieval: Retrieving a single relevant sentence based on similarity, then expanding the context window to include surrounding sentences for the generation step.

Cross-encoder: A model that processes the query and document simultaneously (jointly) to output a relevance score, often used for reranking.

LLM Rerank: Using a Large Language Model to evaluate and re-order retrieved documents based on relevance to the query.

Naive RAG: A standard baseline using fixed-size text chunking and cosine similarity retrieval without additional optimization.

Cohere Rerank: A commercial cross-encoder service used to re-score and re-order retrieved documents.