RAG: Retrieval-Augmented Generation—systems that improve LLM generation by retrieving relevant external data to ground the answer
MS MARCO V2.1: A curated, deduplicated version of the Microsoft Machine Reading Comprehension dataset, specifically segmented for RAG tasks
BM25: Best Matching 25—a probabilistic information retrieval function used to rank documents based on query term frequency
LSH: Locality Sensitive Hashing—an algorithmic technique that hashes similar input items into the same buckets with high probability, used here for deduplication
MinHash: A technique used to estimate the similarity between two sets (Jaccard similarity), used within LSH for document deduplication
Shingles: Consecutive sub-sequences of tokens (e.g., 9-gram) used to represent documents for similarity estimation
RankZephyr: A specific open-source Large Language Model fine-tuned for the task of reranking retrieval results
LLM-as-a-judge: An evaluation method where a strong LLM (like GPT-4) is used to score or compare the quality of outputs from other models
Sliding window: A chunking technique where a fixed-size window moves over text with a specific stride (overlap) to create segments
Factoid queries: Questions that have a short, concise, factual answer (e.g., 'What is the capital of France?'), which this paper aims to avoid in favor of complex queries