RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents or structured data
Truthfulness: A composite metric defined as Perfect_rate + 0.5*Acceptable_rate - Hallucination_rate, penalizing wrong answers more than 'I don't know'
Mock APIs: Simulated interfaces provided in the benchmark that mimic accessing structured Knowledge Graphs (e.g., getting a movie director)
Dynamism: Categorization of questions based on how frequently the answer changes (Real-time, Fast-changing, Slow-changing, Static)
Chain-of-thought: A prompting technique where the model generates intermediate reasoning steps before the final answer
Auto-eval: Using an LLM (e.g., ChatGPT) as a judge to grade answers as accurate, missing, or hallucinated
Knowledge Graph (KG): A structured representation of facts (entities and relationships) used for precise data retrieval
BGE: BAAI General Embedding—a popular dense retrieval model used for embedding text