Probing Structured Semantics Understanding and Generation of Language Models via Question Answering

📝 Paper Summary

Knowledge Base Question Answering (KBQA) Semantic Parsing LLM Evaluation

LLMs are highly proficient at understanding formal languages (translating them to natural language) but struggle significantly to generate them correctly, with performance heavily dependent on how similar the formal syntax is to natural language.

Core Problem

While LLMs are used for semantic parsing in KBQA, there is little understanding of their varying proficiency across different formal languages (KoPL, SPARQL, Lambda DCS) or their inherent ability to understand vs. generate these structures.

Why it matters:

Current approaches blindly apply LLMs to semantic parsing without knowing which formal languages leverage LLM strengths best
Understanding the 'generation gap' between understanding and producing code is crucial for designing reliable neuro-symbolic systems
Selecting the wrong formal language target (e.g., one highly dissimilar to natural language) can bottleneck KBQA performance regardless of model size

Concrete Example: When asked to generate a Lambda DCS logical form for 'Which cost less? Batman Begins released in Italy or Toostie', models like GPT-3.5 fail to produce valid code even with few-shot examples, achieving <5% accuracy, whereas they can easily translate the reverse direction (code to text).

Key Novelty

Bidirectional Proficiency Probing for KBQA Formalisms

Evaluates LLMs on both 'Understanding' (Code-to-Text) and 'Generation' (Text-to-Code) across three distinct formal languages (KoPL, SPARQL, Lambda DCS) to reveal asymmetry in capabilities
Proposes a 'Contrastive Evaluation' for understanding: instead of unreliable text metrics (BLEU), train a small parser on LLM-generated synthetic data and measure its KBQA accuracy against a parser trained on human data
Introduces a structure-preserving retrieval strategy for In-Context Learning that selects demonstrations based on tree-edit distance of logical skeletons rather than just semantic similarity

Architecture

Illustration of the bidirectional probing tasks: Formal Language Understanding (translating LF to NLQ) and Formal Language Generation (translating NLQ to LF), showing inputs and retrieved exemplars.

Evaluation Highlights

Understanding Gap: Text-Davinci-003 achieves 88.1% accuracy on KoPL understanding (close to human's 90.6%), but only 41.6% on generation.
Language Sensitivity: Models perform best on KoPL (most NL-like), achieving 41.6% generation accuracy, vs. 22.5% for SPARQL and 10.0% for Lambda DCS.
Entity Linking Impact: Adding entity/relation candidates to the prompt boosts Text-Davinci-003's SPARQL generation from ~2% to 22.5%.

Breakthrough Assessment

7/10

Provides a valuable, rigorous diagnostic analysis of LLM limitations in semantic parsing. The finding that 'understanding >> generation' and the ranking of formal languages by NL-similarity are important practical insights for KBQA system design.

⚙️ Technical Details

Problem Definition

Setting: Probing LLMs on bidirectional translation between Natural Language Questions (NLQ) and Logical Forms (LF) using In-Context Learning without fine-tuning.

Inputs: For Understanding: Logical Form l*. For Generation: Natural Language Question q*.

Outputs: For Understanding: Natural Language Question q. For Generation: Logical Form l.

Pipeline Flow

Input (LF or NLQ)
Demonstration Retrieval (Structure-aware for LF, BM25 for NLQ)
Prompt Construction (Instruction + Examples + Input)
LLM Inference (Generate NLQ or LF)
Evaluation (Parser Training for Understanding, Execution/Match for Generation)

System Modules

Demonstration Retriever (Understanding) (Retrieval)

Select k examples with logical skeletons most similar to the target input to help the LLM parse structure

Model or implementation: Tree Edit Distance / Greedy Search

Demonstration Retriever (Generation) (Retrieval)

Select k examples similar to the target question

Model or implementation: BM25 (with entity masking)

LLM Prober

Generate the target translation using ICL

Model or implementation: Various (GPT-3.5, Llama-2, FLAN-T5, etc.)

Evaluation Parser

Assess quality of LLM-generated NLQs by training a parser on them and measuring its accuracy

Model or implementation: BART-base or Bi-LSTM

Novel Architectural Elements

Contrastive Evaluation Framework: Evaluating 'Understanding' not by text similarity but by the utility of generated data for training a downstream parser
Skeleton-based Demonstration Retrieval: Selecting ICL examples by minimizing Tree Edit Distance between logical form skeletons rather than semantic similarity

Modeling

Base Model: Evaluated multiple: GPT-2 (Large/XL), GPT-J (6B), FLAN-T5 (L/XL/XXL), Llama-2 (7B/13B/70B), GLM-130B, Text-Davinci-001/003

Training Method: In-Context Learning (Inference Only) for LLMs; Supervised Fine-Tuning for Evaluation Parsers

Training Data:

KQA Pro (KoPL)
Overnight (Lambda DCS)
GrailQA (SPARQL)

Key Hyperparameters:

demonstration_count_k: 3 (Understanding), Max context (Generation)
parser_learning_rate: Not reported in the paper
parser_batch_size: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard Semantic Parsing: This work evaluates zero/few-shot ICL capabilities of general LLMs rather than fully supervised specialized parsers.
vs. Previous KBQG: Uses LLMs with structure-aware ICL instead of training specific generators.

Limitations

Code-pretrained models (like Codex) were not systematically evaluated due to access issues, though text-davinci-003 includes code training.
Evaluation is limited to three formal languages and specific datasets.
Generation evaluation relies on strict execution/match, which might penalize valid but syntactically distinct variations (though canonical forms minimize this).
FLAN-T5 models showed anomalous degradation with scale, which was not fully explained beyond error analysis.

Reproducibility

Code: https://github.com/Matthewlliu/structure_probe

Code available at https://github.com/Matthewlliu/structure_probe. Datasets (KQA Pro, Overnight, GrailQA) are public. Model weights for GPT-3.5 are not available (API access). Specific hyperparameters for the BART evaluation parsers are referenced in Appendix but specific values like LR/BS are not detailed in main text.

📊 Experiments & Results

Evaluation Setup

Few-shot In-Context Learning on KBQA datasets

Benchmarks:

KQA Pro (KoPL (Complex reasoning))
GrailQA (SPARQL (Graph patterns))
Overnight (Lambda DCS (Compact logical forms))

Metrics:

Execution Accuracy (for Generation)
Parser Accuracy (proxy for Understanding quality)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Understanding Tasks: LLMs generate high-quality synthetic data that trains parsers almost as well as human-labeled data.
KQA Pro (KoPL)	Parser Accuracy	90.6	88.1	-2.5
GrailQA (SPARQL)	Parser F1	74.7	73.8	-0.9
Generation Tasks: LLMs struggle to generate correct logical forms from natural language.
KQA Pro (KoPL)	Accuracy	88.1	41.6	-46.5
GrailQA (SPARQL)	F1	73.8	22.5	-51.3
Overnight (Lambda DCS)	Accuracy	79.0	10.0	-69.0
KQA Pro (KoPL)	Accuracy	80.0	62.7	-17.3

Experiment Figures

Impact of demonstration number and entity linking on Generation performance for Text-Davinci-003.

Main Takeaways

Asymmetry in Proficiency: LLMs are excellent at interpreting formal languages (translating to NL) but struggle to generate them, suggesting they are better 'readers' than 'writers' of logical forms.
Hierarchy of Formalisms: KoPL > SPARQL > Lambda DCS. The closer a formal language's structure and naming is to natural language (KoPL), the better LLMs perform.
Entity Linking is Critical: Providing entity/relation candidates in the prompt is essential for languages like SPARQL, boosting performance significantly (e.g., from ~2% to ~22%).
Scaling Anomaly: FLAN-T5 performance degrades with size on these tasks, unlike GPT and Llama models which scale positively.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Base Question Answering (KBQA)
Semantic Parsing (translating text to logical forms)
In-Context Learning (few-shot prompting)
Basic graph/tree structures in formal languages

Key Terms

KoPL: Knowledge-oriented Programming Language—a Lisp-like, function-chain structured query language that explicitly models reasoning steps (e.g., Find -> Relate -> Filter)

Lambda DCS: Lambda Dependency-Based Compositional Semantics—a compact logical formalism focusing on variable-free composition, often represented as s-expressions

SPARQL: Standard query language for RDF databases, using triple patterns to match graph structures

Tree Edit Distance (TED): A metric calculating the minimum number of node insertions, deletions, and re-labelings required to transform one tree structure into another

Skeleton: The structural abstraction of a logical form obtained by removing specific entity/relation arguments, leaving only operators/functions (used for similarity search)

In-Context Learning (ICL): A prompting technique where the model is given a few input-output examples (demonstrations) in the prompt before the actual test input

Chain-of-Thought (CoT): A prompting strategy where the model generates intermediate reasoning steps (here, generating the skeleton first) before the final answer