Parametric Knowledge: Information stored in the model's pre-trained weights/parameters rather than provided in the context.
Groundedness: The extent to which a model's response is derived solely from the provided retrieved documents.
Trust-Score: The proposed holistic metric averaging Grounded Refusals, Answer Correctness, and Groundedness of Citations.
DPO: Direct Preference Optimization—an algorithm for fine-tuning LLMs to align with human preferences using pairs of preferred and dispreferred outputs.
Hallucination (in RAG): Errors where the model invents information, fails to use documents, refuses when it shouldn't, or cites incorrectly.
Answerability: Whether the provided documents D contain sufficient information to answer question q.
NLI: Natural Language Inference—a task determining if a premise entails a hypothesis; used here to verify if a cited document actually supports a claim.
ASQA: Ambiguous SQuAD—a QA dataset focusing on ambiguous questions requiring long-form answers.
QAMPARI: A QA benchmark requiring answers that consist of lists of entities.
ELI5: Explain Like I'm 5—a long-form QA dataset requiring detailed explanations.