Exploring Chain-of-Thought Style Prompting for Text-to-SQL

📝 Paper Summary

Text-to-SQL Parsing In-Context Learning Chain-of-Thought Prompting

QDecomp improves Text-to-SQL parsing by prompting LLMs to decompose complex questions into sub-questions and identify relevant columns in a single pass, avoiding the error propagation of detailed SQL-based reasoning.

Core Problem

Large Language Models struggle with Text-to-SQL parsing because standard prompting lacks reasoning, while existing Chain-of-Thought methods introduce error propagation through overly detailed steps or computationally expensive iteration.

Why it matters:

Text-to-SQL is critical for building natural language interfaces to databases, but training supervised models requires expensive expert annotation
Existing CoT methods like Least-to-Most prompting require multiple model calls (iterative), increasing latency and cost
Detailed reasoning steps (describing every SQL clause) often contain hallucinations that cascade into the final SQL query (error propagation)

Concrete Example: For the question 'How many United Airlines flights go to City Aberdeen?', a Least-to-Most prompter generates a sub-query for 'go to City' first. It might incorrectly guess a column name for 'City' in this intermediate step because it hasn't seen the full context 'Aberdeen' yet. This incorrect column name propagates to the final query, causing execution failure.

Key Novelty

Question Decomposition Prompting (QDecomp + InterCOL)

Instead of describing SQL execution steps (which is error-prone), prompt the model to break the natural language question into sub-questions in a single pass
Augment reasoning steps with 'InterCOL' (Intermediate Column detection), where the model explicitly lists the table and column names relevant to each sub-question before generating the final SQL

Architecture

Comparison of four prompting strategies: Standard, Chain-of-Thought, Least-to-Most, and the proposed QDecomp / QDecomp+InterCOL

Evaluation Highlights

+5.2% absolute gain in test-suite accuracy on Spider Dev compared to standard prompting (68.4% vs 63.2%)
+5.5% absolute gain on Spider Realistic compared to standard prompting (56.5% vs 51.0%)
Outperforms Least-to-Most prompting by 2.4% on Spider Dev while using a single-pass generation instead of iterative prompting

Breakthrough Assessment

8/10

Significantly improves Text-to-SQL performance without fine-tuning, challenging the prevailing assumption that iterative prompting is necessary for complex reasoning. The QDecomp+InterCOL design effectively isolates schema linking errors.

⚙️ Technical Details

Problem Definition

Setting: Cross-domain Text-to-SQL parsing using In-Context Learning (Few-Shot)

Inputs: Natural language question q, Database Schema S (tables and columns)

Outputs: Executable SQL query

Pipeline Flow

Prompt Construction (Instruction + 8 Examples)
Codex Inference (Single Pass)
Output Parsing (Extract SQL)

System Modules

Prompt Constructor

Assembles the input prompt with API Docs format schema and few-shot examples containing QDecomp+InterCOL reasoning paths

Model or implementation: N/A (Deterministic)

LLM Inference

Generates the decomposition steps, intermediate column identifications, and final SQL in one continuous sequence

Model or implementation: Codex (code-davinci-002)

Novel Architectural Elements

Single-pass question decomposition prompting structure: generates sub-questions and final SQL in one output, avoiding the latency of iterative Least-to-Most prompting
InterCOL annotation format: embeds explicit table/column grounding steps within the natural language decomposition trace

Modeling

Base Model: Codex (code-davinci-002)

Compute: Not reported in the paper (Inference only via OpenAI API)

Comparison to Prior Work

vs. Chain-of-Thought: QDecomp avoids describing SQL clauses directly, reducing hallucination of SQL syntax in reasoning steps
vs. Least-to-Most: QDecomp uses single-pass generation instead of multiple API calls, reducing cost and latency while maintaining global context
vs. RASAT+PICARD: QDecomp achieves comparable performance (with extra-hard examples) without fine-tuning or constrained decoding architectures

Limitations

Dependency on proprietary, now-deprecated model (Codex code-davinci-002)
The InterCOL method requires manual annotation of column mappings for the few-shot examples
Performance on 'extra hard' queries still lags significantly behind 'easy' queries (38.1% vs 89.6%)
Does not use database content (cell values) for linking, unlike some baselines (LEVER)

📊 Experiments & Results

Evaluation Setup

Few-shot (8-shot) in-context learning on cross-domain Text-to-SQL datasets

Benchmarks:

Spider (Cross-domain Text-to-SQL (Development Set))
Spider Realistic (Text-to-SQL with removed explicit column mentions)
GeoQuery (Single-domain Text-to-SQL)

Metrics:

Test-suite execution accuracy (TS)
Standard execution accuracy (EX)
Component matching accuracy
Statistical methodology: Reported mean and standard deviation over 5 different sets of randomly selected in-context examples

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on Spider datasets show QDecomp+InterCOL outperforms both Standard prompting and other CoT styles (Chain-of-Thought, Least-to-Most).
Spider Dev	Test-suite execution accuracy (TS)	63.2	68.4	+5.2
Spider Dev	Test-suite execution accuracy (TS)	66.0	68.4	+2.4
Spider Dev	Test-suite execution accuracy (TS)	56.8	68.4	+11.6
Spider Realistic	Test-suite execution accuracy (TS)	51.0	56.5	+5.5
Robustness check using 'Extra Hard' (G3) examples for in-context learning.
Spider Dev	Test-suite execution accuracy (TS)	58.2	68.8	+10.6

Main Takeaways

Iterative prompting (Least-to-Most) is not necessary for Text-to-SQL; single-pass decomposition yields better accuracy and lower latency
Traditional Chain-of-Thought (describing SQL clauses) hurts performance compared to standard prompting because detailed steps introduce errors that propagate to the code
Explicitly prompting for table and column names (InterCOL) during the reasoning phase significantly improves schema linking
QDecomp is robust to example selection, performing well even when prompted with only 'Extra Hard' examples, whereas baselines degrade

📚 Prerequisite Knowledge

Prerequisites

Text-to-SQL task definition
In-Context Learning / Few-shot Prompting
Basics of SQL structure (SELECT, FROM, WHERE)
Chain-of-Thought (CoT) concepts

Key Terms

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

Least-to-Most: A prompting strategy that breaks a problem into sub-problems and solves them iteratively (usually requiring multiple model calls)

Schema Linking: The process of identifying which tables and columns in a database schema correspond to the entities mentioned in a natural language question

Test-suite accuracy: A robust evaluation metric for Text-to-SQL that compares the execution results of the predicted SQL against the gold SQL on multiple synthetic database contents to avoid false positives

InterCOL: Intermediate Column detection—the paper's proposed method of explicitly listing table-column pairs involved in each decomposed sub-question

QDecomp: Question Decomposition Prompting—the paper's proposed method to generate sub-questions and the final SQL in a single pass

Codex: A family of Large Language Models trained on code (specifically code-davinci-002 in this paper)

Spider: A large-scale, cross-domain semantic parsing and Text-to-SQL benchmark dataset