Can ChatGPT Replace Traditional KBQA Models? An In-Depth Analysis of the Question Answering Performance of the GPT LLM Family

📝 Paper Summary

Knowledge-Based Question Answering (KBQA) Large Language Model Evaluation

This paper evaluates the GPT family's ability to replace traditional KBQA models using a comprehensive black-box testing framework covering ~190,000 complex questions across 8 datasets.

Core Problem

Existing evaluations of ChatGPT on Knowledge-Based Question Answering (KBQA) are limited in scale and scope, making it unclear if LLMs can replace traditional models that query structured knowledge bases.

Why it matters:

LLMs generate free-text answers rather than exact entities, making traditional Exact Match (EM) metrics unreliable without adaptation
Current benchmarks lack large-scale testing of complex reasoning types (e.g., set operations, filtering) to identify specific LLM limitations
It is unknown whether LLMs' internal knowledge can supersede the need for external structured Knowledge Bases (KBs) in complex QA

Concrete Example: A traditional KBQA model queries a database for 'Person name'. ChatGPT generates a sentence like 'The person is [Name].' Standard evaluation fails to match this. Additionally, ChatGPT might answer correctly but fail when the same question is slightly paraphrased or has a typo.

Key Novelty

Feature-Driven Black-Box KBQA Evaluation Framework

Treats LLMs as 'knowledge bases' and evaluates them using an extended Exact Match metric that parses constituent trees to find candidate answers
Applies software engineering testing principles (CheckList) to KBQA: Minimal Functionality Tests (basic ability), Invariance Tests (robustness to typos/paraphrasing), and Directional Expectation Tests (controllability via prompts)

Architecture

Overview of the Evaluation Framework

Evaluation Highlights

GPT-4 achieves 90.45% accuracy on WebQuestionSP (WQSP), outperforming the state-of-the-art traditional model (73.10%)
On the newer GrailQA dataset, GPT-4 (51.40%) still lags behind the traditional SOTA model (76.31%), showing LLMs struggle with the latest complex benchmarks
GPT-4 demonstrates high stability (91.70%) in invariance tests, approaching the perfect stability (100%) of traditional models

Breakthrough Assessment

7/10

Comprehensive evaluation framework that adapts traditional QA metrics to LLMs. Provides valuable insights into the 'LLM as KB' hypothesis, though it doesn't propose a new model architecture.

⚙️ Technical Details

Problem Definition

Setting: KB-based Complex Question Answering (KB-based CQA) where the model answers natural language questions using its internal knowledge

Inputs: Complex natural language question q (potentially with prompts)

Outputs: Free-text answer a which is parsed to extract specific entities or values

Pipeline Flow

Feature-driven Labeling (Tagging questions with Answer/Reasoning/Language types)
LLM Inference (Generating answers via API)
Extended Answer Matching (Parsing output → Candidate Extraction → Fuzzy Matching)
CheckList Testing (MFT, INV, DIR analysis)

System Modules

Feature Labeler

Assign standardized tags to questions (e.g., SetOperation, Boolean, PER) for granular analysis

Model or implementation: bert-base-NER / Keyword Matching

LLM Inference

Generate natural language answers based on internal knowledge

Model or implementation: GPT-3 / GPT-3.5 / ChatGPT / GPT-4 / FLAN-T5

Answer Evaluator

Extract and verify answers from free text against ground truth

Model or implementation: Constituent Parser + m-bert

Novel Architectural Elements

Extended Exact Match strategy: Uses constituent tree parsing to extract candidate phrases and m-bert cosine similarity for fuzzy matching against alias lists
Unified feature labeling schema combining answer types (PER, LOC, etc.) and reasoning types (SetOperation, Counting, etc.) across heterogeneous datasets

Modeling

Base Model: GPT-4 (and predecessors GPT-3, GPT-3.5 variants, ChatGPT) vs. FLAN-T5-XXL

Comparison to Prior Work

vs. Traditional KBQA: LLMs use internal parameters as the knowledge source (Unsupervised/Zero-shot) rather than querying an external structured KB (Supervised)
vs. HELM [not cited in paper]: Focuses specifically on complex reasoning types in KBQA (Set operations, multi-hop) rather than broad NLP tasks

Limitations

Evaluation relies on the 'LLM as Knowledge Base' assumption; hallucinated facts are counted as wrong but not explicitly distinguished from reasoning errors
Fuzzy matching threshold (0.78) is empirically determined and may produce false positives/negatives
Limited to questions where the answer exists in the model's training data (temporal cut-off issues not deeply explored)

Reproducibility

Code: https://github.com/tan92hl/Complex-Question-Answering-Evaluation-of-GPT-family.git

📊 Experiments & Results

Evaluation Setup

Zero-shot Question Answering on 8 KBQA datasets

Benchmarks:

KQApro (Complex KBQA)
LC-quad2.0 (Complex KBQA)
WebQuestionSP (WQSP) (KBQA)
ComplexWebQuestions (CWQ) (Complex KBQA)
GrailQA (Complex KBQA (Generalization))
GraphQ (KBQA)
QALD-9 (Multilingual KBQA)
MKQA (Multilingual KBQA)

Metrics:

Accuracy (Exact Match)
F1 score (for GraphQ, QALD-9, LC-quad2.0)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GPT-4 outperforms traditional SOTA on older/smaller datasets but lags on newer, more complex ones.
WebQuestionSP (WQSP)	Accuracy	73.10	90.45	+17.35
LC-quad2.0	F1	33.10	54.95	+21.85
GrailQA	Accuracy	76.31	51.40	-24.91
KQApro	Accuracy	93.85	57.20	-36.65
Chain-of-Thought (CoT) prompting significantly helps numerical reasoning but has mixed effects on other types.
Prompting Effect	Exact Match	Not reported in the paper	Not reported in the paper	+20.00
Invariance tests show GPT-4 approaching traditional model stability.
Stability Rate	Stability %	76.76	91.70	+14.94

Main Takeaways

GPT models outperform traditional KBQA on older datasets (WQSP, LC-quad2.0) but lag significantly on newer, complex benchmarks (GrailQA, KQApro).
Model capability scales with generation: GPT-4 > ChatGPT > GPT-3.5 > GPT-3, with consistent performance 'shapes' across datasets, suggesting architectural commonalities.
Chain-of-Thought (CoT) prompting provides massive gains for numerical questions (+20-35%) but minimal or negative impact for multi-hop and star-shaped reasoning.
LLMs are generally robust to typos and paraphrasing (high invariance), with GPT-4 achieving near-perfect stability compared to earlier versions.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Knowledge-Based Question Answering (KBQA) vs. Text QA
Familiarity with evaluation metrics like Exact Match and F1
Basic knowledge of Black-box testing concepts (CheckList)

Key Terms

KBQA: Knowledge-Based Question Answering—systems that answer questions by querying a structured database (Knowledge Base)

CheckList: A behavioral testing framework for NLP models that checks capabilities (MFT), robustness (INV), and controllability (DIR)

Exact Match (EM): A metric that counts a prediction as correct only if it strictly matches the ground truth; adapted here to handle fuzzy matches and alias lists

MFT: Minimal Functionality Test—checks if the model can solve simple, specific reasoning tasks (e.g., only set operations)

INV: Invariance Test—checks if the model's answer remains consistent despite irrelevant changes to the input (e.g., typos, paraphrasing)

DIR: Directional Expectation Test—checks if the model's output changes in expected ways when the input is modified (e.g., adding constraints)

CoT: Chain-of-Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps

SPARQL: A query language for databases, often used by traditional KBQA models to retrieve answers; LLMs generate text instead

Constituent Tree: A grammatical representation of a sentence used here to extract Noun Phrases (NP) or Verb Phrases (VP) as candidate answers