RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents.
Close-book: Evaluating an LLM's ability to answer questions relying solely on its internal pre-trained knowledge without external documents.
Golden Reference: The ground-truth document that contains the correct answer, provided directly to the model to test its reasoning upper bound.
BM25: A probabilistic information retrieval function used to rank documents based on query term frequency.
EM: Exact Match—metric measuring if the prediction is strictly identical to or contains the ground truth.
Faithfulness: The ability of the model to stick to the provided external context rather than hallucinating or relying on internal (potentially outdated) memory.
Time-sensitive QA: Questions where the correct answer depends on a specific timestamp (e.g., admission scores for 2023 vs. 2024).
Noise Ratio: The proportion of irrelevant documents mixed with relevant ones to test the model's robustness.