RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents
ContextualBench: A new evaluation framework introduced in this paper compiling 7 RAG benchmarks (like HotpotQA, TriviaQA) with consistent settings
Hallucination: When a model generates incorrect information or information not supported by the provided context
Parametric Knowledge: Facts stored within the model's weights during pre-training, as opposed to knowledge provided in the context window
Instruction Hierarchy: Ensuring the model prioritizes system prompts over potentially malicious instructions found in user inputs or retrieved data
Agentic: Systems capable of using tools, planning, and performing multi-step actions to solve problems
FaithEval: An evaluation suite measuring how LLMs remain faithful to context under unknown, conflicting, or counterfactual scenarios
SFT: Supervised Fine-Tuning—training a model on labeled examples
Preference Learning: Training technique to align model outputs with human preferences (often via DPO or PPO)
ReAct: Reasoning and Acting—a prompting strategy where models generate reasoning traces and actions in an interleaved manner