Large Language Models are Versatile Decomposers: Decomposing Evidence and Questions for Table-based Reasoning

📝 Paper Summary

Table-based Reasoning Chain-of-Thought Reasoning

DATER improves table-based reasoning by using LLMs to prune huge tables into relevant sub-tables and decomposing complex questions into executable SQL queries to fill numerical gaps.

Core Problem

Large Language Models struggle with table-based reasoning because huge tables overflow context windows with irrelevant noise, and complex questions requiring multi-step symbolic operations often trigger hallucinations.

Why it matters:

Directly encoding full tables is computationally intractable and distracts models with 'huge' irrelevant information (e.g., 30+ rows)
Standard Chain-of-Thought prompting often fails at faithful symbolic operations (arithmetic, counting) on tabular data, leading to calculation errors
Existing solutions usually require fine-tuning on large datasets, whereas in-context learning capabilities for tabular tasks remain under-explored

Concrete Example: Question: 'Minnesota played at home more times than away.' A standard model might hallucinate the counts. DATER generates SQL ('SELECT COUNT(*) FROM w WHERE home=Minnesota'), executes it to get specific numbers ({6} and {8}), and fills them back into the reasoning chain.

Key Novelty

Decompose Evidence And Questions for Table-based Reasoning (DATER)

Evidence Decomposition: Instead of feeding the whole table, the model predicts specific row and column indices to extract a small, relevant 'sub-evidence' table
Parsing-Execution-Filling: Uses SQL as a bridge for logic. The model generates a query with numerical placeholders, executes SQL on the table to get exact values, and fills the placeholders to create a 'reliable sub-question'

Architecture

The overview of the DATER framework showing the two-stream decomposition process.

Evaluation Highlights

Surpasses human performance (92.1%) on TabFact for the first time, achieving 93.0% accuracy when combined with the PASTA model
Achieves state-of-the-art 65.9% accuracy on WikiTableQuestion, outperforming the Binder baseline by 4.0%
Improves Codex performance on TabFact by +13.0% (85.6% vs 72.6%) purely through the DATER decomposition strategy without fine-tuning

Breakthrough Assessment

9/10

Achieving superhuman performance on TabFact and setting a new SOTA on WikiTableQuestion using a prompt-based decomposition strategy is a significant advancement in table reasoning.

⚙️ Technical Details

Problem Definition

Setting: Table-based Fact Verification (FV) and Question Answering (QA)

Inputs: A table T (rows and columns) and a natural language question/statement Q

Outputs: An answer A (Boolean for FV, text span/number for QA)

Pipeline Flow

Evidence Decomposer: Input (Table, Question) → Output (Sub-table indices)
Question Decomposer: Input (Table, Question) → Output (SQL + Cloze-style Sub-questions)
SQL Executor: Input (SQL, Table) → Output (Filled Sub-questions)
Reasoner: Input (Sub-table, Filled Sub-questions) → Output (Answer)

System Modules

Evidence Decomposer (Decomposition)

Identify relevant rows and columns to reduce table size

Model or implementation: Codex (code-davinci-002)

Question Decomposer (Parsing) (Decomposition)

Convert natural language logic into SQL and cloze-style sentences

Model or implementation: Codex (code-davinci-002)

SQL Executor

Execute SQL on the table to obtain accurate intermediate values

Model or implementation: Standard SQL Database Engine

Joint Reasoner

Generate the final answer using the simplified evidence and reliable sub-questions

Model or implementation: Codex (code-davinci-002)

Novel Architectural Elements

Parsing-execution-filling strategy: Using generated SQL as a deterministic calculator to fill placeholders in natural language reasoning steps
Dual decomposition pipeline: Explicitly decomposing both the data (evidence) and the intent (question) before the final reasoning stage

Modeling

Base Model: Codex (code-davinci-002)

Comparison to Prior Work

vs. Binder: Binder generates full programs; DATER decomposes into sub-questions and uses SQL only for intermediate numerical steps, keeping the reasoning flow in natural language
vs. TAPEX: TAPEX requires expensive fine-tuning on large datasets; DATER works via few-shot prompting with frozen LLMs
vs. Chain-of-Thought (CoT): Standard CoT hallucinates arithmetic; DATER's 'parsing-execution-filling' ensures numerical accuracy via SQL execution

Limitations

Dependency on the capabilities of the underlying LLM (Codex) to generate correct SQL; failure in SQL generation breaks the chain
Effectiveness relies on the quality of the few-shot annotations provided in the context
Limited to tabular data reasoning; does not extend to unstructured text-only tasks
Relies on closed-source models (Codex) which limits reproducibility if APIs change

Reproducibility

Code: https://github.com/AlibabaResearch/DAMO-ConvAI

Publicly available code at GitHub. Uses closed-source API (Codex/GPT-3) which may have accessibility issues or version deprecation. Few-shot examples are manually written (4 for TabFact, 2 for WikiTableQuestion, 6 for FetaQA).

📊 Experiments & Results

Evaluation Setup

Few-shot In-context Learning (no training on dataset)

Benchmarks:

TabFact (Table-based Fact Verification (Binary Classification))
WikiTableQuestion (Table-based Question Answering (Complex reasoning))
FetaQA (Free-form Table Question Answering (Long generation))

Metrics:

Accuracy (TabFact)
Denotation Accuracy (WikiTableQuestion)
BLEU
ROUGE-1
ROUGE-2
ROUGE-L
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DATER significantly outperforms baselines on TabFact, including human performance when combined with fine-tuned models.
TabFact	Accuracy	92.1	93.0	+0.9
TabFact	Accuracy	72.6	85.6	+13.0
TabFact	Accuracy	85.1	85.6	+0.5
DATER achieves state-of-the-art results on WikiTableQuestion, showing strong generalization to complex questions.
WikiTableQuestion	Accuracy	61.9	65.9	+4.0
WikiTableQuestion	Accuracy	47.6	65.9	+18.3
Ablation studies confirm that both evidence and question decomposition are critical.
WikiTableQuestion	Accuracy	61.4	65.9	+4.5
TabFact	Accuracy	81.8	85.6	+3.8

Main Takeaways

Evidence decomposition (pruning tables) is crucial for 'huge' tables, reducing token usage and distraction; sub-evidence is ~3x smaller than original evidence.
Question decomposition via SQL bridging prevents the calculation hallucinations common in standard Chain-of-Thought prompting.
The method is versatile: it works as a pure prompt-based solution with frozen LLMs or can enhance fine-tuned models (like PASTA/OmniTab) by providing cleaner inputs.
Rule-based decomposition baselines fail because they lack the semantic understanding and commonsense knowledge (e.g., linking 'against the rockies' to 'opponent' column) that LLMs possess.

📚 Prerequisite Knowledge

Prerequisites

In-context learning (Few-shot prompting)
SQL (Structured Query Language)
Chain-of-Thought (CoT) reasoning

Key Terms

DATER: Decompose Evidence And Questions for Table-based Reasoning—the proposed framework

TabFact: A benchmark dataset for checking if a statement is true based on a Wikipedia table

WikiTableQuestion: A dataset of complex questions on Wikipedia tables requiring multi-step reasoning

FetaQA: A dataset for free-form table question answering requiring long-form answers

Codex: A powerful Large Language Model (specifically code-davinci-002) trained on code and text, used here as the backbone

In-context learning: Providing a model with example inputs and outputs in the prompt to teach it a task without updating weights

Self-consistency: A decoding strategy where multiple reasoning paths are sampled, and the most consistent answer is selected

SQL: Structured Query Language—a domain-specific language used to manage and query data in databases