KBQA: Knowledge-Based Question Answering—systems that answer questions by querying a structured database (Knowledge Base)
CheckList: A behavioral testing framework for NLP models that checks capabilities (MFT), robustness (INV), and controllability (DIR)
Exact Match (EM): A metric that counts a prediction as correct only if it strictly matches the ground truth; adapted here to handle fuzzy matches and alias lists
MFT: Minimal Functionality Test—checks if the model can solve simple, specific reasoning tasks (e.g., only set operations)
INV: Invariance Test—checks if the model's answer remains consistent despite irrelevant changes to the input (e.g., typos, paraphrasing)
DIR: Directional Expectation Test—checks if the model's output changes in expected ways when the input is modified (e.g., adding constraints)
CoT: Chain-of-Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps
SPARQL: A query language for databases, often used by traditional KBQA models to retrieve answers; LLMs generate text instead
Constituent Tree: A grammatical representation of a sentence used here to extract Noun Phrases (NP) or Verb Phrases (VP) as candidate answers