Interactive-KBQA: Multi-Turn Interactions for Knowledge Base Question Answering with Large Language Models

📝 Paper Summary

Semantic Parsing Knowledge Base Question Answering (KBQA) Agentic LLM Frameworks

Interactive-KBQA treats the LLM as an agent that iteratively interacts with a Knowledge Base using three generic tools to generate SPARQL queries, enabling high performance with minimal annotated examples.

Core Problem

Semantic parsing-based KBQA methods typically require expensive, large-scale data annotation and struggle with complex queries involving constraints or multi-hop reasoning.

Why it matters:

Traditional semantic parsing is resource-intensive and lacks transparency ('black box' reasoning).
Current LLM-based methods underutilize the model's reasoning capabilities, often limiting them to simple classification or draft generation.
Complex queries (e.g., numerical constraints, aggregations) remain difficult for Information Retrieval-based approaches.

Concrete Example: For the question 'How many basketball players are taller than 2 meters?', standard IR methods fail because they rely on simple entity recognition. An LLM might hallucinate predicates without verifying against the KB schema. Interactive-KBQA breaks this down: search for 'basketball players', filter by 'height > 2m', and count the results.

Key Novelty

Agent-Environment Paradigm for KBQA

Conceptualizes the LLM as an agent and the Knowledge Base as an environment, interacting via a unified thought-action loop.
Introduces three generic atomic tools (SearchNodes, SearchGraphPatterns, ExecuteSPARQL) adaptable to heterogeneous databases (Freebase, Wikidata, Movie KB).
Implements a human-in-the-loop annotation process where humans correct intermediate reasoning steps, creating high-quality low-resource training data efficiently.

Architecture

The overall framework of Interactive-KBQA. It depicts the iterative loop between the LLM (Agent) and the Knowledge Base (Environment).

Evaluation Highlights

Outperforms GPT-4 Turbo on ComplexWebQuestions (CWQ) and KQA Pro using only ~50 annotated examples per question type via fine-tuning.
Achieves significant gains on specific complex question types: +29.85% on Comparative and +13.96% on Superlative questions in CWQ compared to baselines.
Demonstrates high efficiency in low-resource settings, rivaling or beating full-data semantic parsing baselines (trained on 3K-33K examples) with minimal data.

Breakthrough Assessment

8/10

Strong contribution in applying agentic workflows to semantic parsing. The human-machine collaborative annotation strategy offers a practical solution to the data scarcity problem in KBQA.

⚙️ Technical Details

Problem Definition

Setting: Semantic parsing over a structured Knowledge Base (KB) K containing entities E, relations R, classes C, and literals L.

Inputs: Natural language question Q and Knowledge Base K

Outputs: Executable SPARQL expression S that retrieves the correct answer

Pipeline Flow

Question Type Classification (selects exemplars)
Prompt Construction (Instruction + Exemplars + Question)
Iterative Interaction Loop (Thought → Action → Observation)
Final Answer Generation

System Modules

Prompt Constructor

Combines instructions, tool definitions, and retrieval-augmented exemplars into a prompt

Model or implementation: N/A (Algorithmic)

Reasoning Agent (Interaction Loop)

Generates thoughts and actions (tool calls) based on interaction history

Model or implementation: Fine-tuned LLM (e.g., Llama-2-7B, Mistral-7B) or Closed LLM (GPT-4)

KB Toolset (Interaction Loop)

Executes actions against the KB and returns observations

Model or implementation: SPARQL Engine / API

Novel Architectural Elements

Unified interaction logic using three atomic tools (SearchNodes, SearchGraphPatterns, ExecuteSPARQL) that abstract away database-specific complexities
Human-Machine Collaborative Annotation pipeline where the model's reasoning trace is corrected iteratively to create training data

Modeling

Base Model: Llama-2-7B-Chat, Llama-2-13B-Chat, Mistral-7B-Instruct-v0.2

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Minimize negative log-likelihood of the target tokens (thoughts and actions).

Formally: Standard causal language modeling loss.

Trainable Parameters: Full fine-tuning (implied by context of SFT on open-source models)

Training Data:

Human-annotated low-resource dataset: ~50 examples per question type randomly sampled from training sets.
Annotators correct the model's 'Action' if unreasonable, feeding the correction back to generate the next step.

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
epochs: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeCAF/KB-BINDER: Interactive-KBQA uses a multi-turn agentic approach rather than single-pass generation or draft-refinement.
vs. StructGPT/ToG: Interactive-KBQA does not assume golden entities are provided; it performs entity linking as part of the interaction. It also handles specific complex structures (CVT, Qualifiers) explicitly.
vs. Pangu [not cited in paper]: Unlike Pangu which focuses on discriminative reasoning, Interactive-KBQA focuses on generative semantic parsing via tool use.

Limitations

Dependency on the quality of the underlying LLM; smaller models (e.g., Llama-2-7B) struggle without fine-tuning.
Latency concerns due to multi-turn interactions (network overhead for multiple SPARQL queries).
Entity linking errors remain a primary failure mode (e.g., finding 'Young Hollywood Awards' instead of the specific award category).
In some cases (e.g., Conjunction questions), the step-by-step predicate identification is less efficient than methods that exploit redundancy.

Reproducibility

Code: https://github.com/JimXiongGM/Interactive-KBQA

Code and data are available at https://github.com/JimXiongGM/Interactive-KBQA. The paper releases the human-annotated dataset with step-wise reasoning processes.

📊 Experiments & Results

Evaluation Setup

KBQA on heterogeneous databases (Freebase, Wikidata, Movie KB)

Benchmarks:

WebQuestionsSP (WebQSP) (KBQA on Freebase (1-hop, 2-hop))
ComplexWebQuestions (CWQ) (Complex KBQA on Freebase (Conjunction, Composition, Comparative, Superlative))
KQA Pro (Large-scale complex KBQA on Wikidata (9 question types))
MetaQA (Multi-hop KBQA on Movie KB)

Metrics:

F1 score
Exact Match (EM)
Random Hits@1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative performance on standard benchmarks. Note that for WebQSP and KQA Pro, the proposed method uses significantly less training data (~50/type) compared to baselines using full datasets (~3K-33K), yet remains competitive.
ComplexWebQuestions (CWQ)	F1	61.1	69.1	+8.0
MetaQA (3-hop)	F1	87.0	90.7	+3.7
WebQuestionsSP (WebQSP)	F1	75.7	73.9	-1.8
KQA Pro	Accuracy	90.55	75.35	-15.2
Detailed breakdown by question type on CWQ showing specific strengths in reasoning-heavy categories.
ComplexWebQuestions (Comparative)	F1	39.60	69.45	+29.85
ComplexWebQuestions (Superlative)	F1	54.12	68.08	+13.96

Main Takeaways

Interactive-KBQA demonstrates that agentic interaction with tools can replace the need for massive labeled datasets in semantic parsing.
Fine-tuning open-source models (Mistral-7B) on a small set of high-quality, human-corrected interaction traces can outperform closed-source models (GPT-4) and full-data baselines on complex query types.
The method excels at 'reasoning-heavy' questions (comparatives, superlatives) where traditional one-shot semantic parsing struggles to capture the logic.
The framework effectively unifies interaction logic across different KB structures (Freebase, Wikidata), proving robustness across schemas.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Base Question Answering (KBQA)
SPARQL query language
In-context learning / Few-shot prompting
Agentic LLM patterns (Thought-Action-Observation)

Key Terms

SPARQL: A semantic query language for databases, used to retrieve and manipulate data stored in Resource Description Framework (RDF) format

CVT: Compound Value Type—a node structure in Freebase used to represent complex data where an entry consists of multiple fields (e.g., a movie role linking an actor and a character name)

KBQA: Knowledge Base Question Answering—systems that answer natural language questions using structured data from knowledge bases

Entity Linking (EL): The process of identifying and disambiguating entities mentioned in text to their corresponding unique entries in a knowledge base

SFT: Supervised Fine-Tuning—training a pre-trained model on a specific dataset to adapt it to a particular task

CoT: Chain-of-Thought—a prompting technique that encourages the model to generate intermediate reasoning steps before the final answer

MCR: Mention Cover Rate—a metric quantifying the difficulty of Entity Linking, defined as the rate at which golden entity names appear directly in the questions