Question Answering over Tabular Data with DataBench: A Large-Scale Empirical Evaluation of LLMs

📝 Paper Summary

Tabular Question Answering LLM Evaluation Table Reasoning

DataBench is a diverse benchmark of 65 real-world tabular datasets and 1300 questions designed to reveal the limitations of LLMs in reasoning over structured data types beyond simple Wikipedia tables.

Core Problem

Existing tabular QA benchmarks primarily rely on clean, small Wikipedia tables that lack the complexity, size, and diverse data types (booleans, lists, URLs) found in real-world data analysis tasks.

Why it matters:

Current benchmarks like OpenWikiTables are saturated and do not test reasoning over lists, booleans, or dirty data
Real-world data analytics involves large tables (millions of rows) and diverse types (dates, URLs, lists) not represented in academic datasets
There is a significant gap in understanding how well current LLMs function as reliable tabular reasoners for business intelligence

Concrete Example: A user asks 'Are there any passengers under 30?' on a dataset. A Wikipedia-based model might fail because it hasn't seen list-type answers like '[Lil Lama, Cody Lama]' or boolean logic columns in training, leading to format errors or hallucinations.

Key Novelty

DataBench: Diverse Real-World Tabular QA Benchmark

Aggregates 65 heterogeneous datasets from domains like Finance, Health, and Sports, moving beyond Wikipedia to real-world sources like Kaggle and government data
Introduces complex answer types rarely tested: lists of numbers, lists of categories, and booleans, alongside standard number/text answers
Evaluates two distinct prompting paradigms: Zero-shot In-Context Learning (feeding data directly) vs. Code-based (generating Python/Pandas code)

Architecture

Overview of the two evaluation prompting strategies: In-Context Learning (Z-ICL) vs. Code-based.

Evaluation Highlights

ChatGPT-3.5 (closed source) achieves 63.0% accuracy on Code-based prompts, significantly outperforming the best open model (CodeLlama-13b) at 33.1%
All models struggle significantly with 'list' answer types; Llama-2-7b achieves only 0.8% accuracy on list[number] questions using In-Context Learning
Using Code-based prompts consistently outperforms In-Context Learning for numerical reasoning, raising accuracy from ~14% (text) to ~43% (code) for CodeLlama-7b on number questions

Breakthrough Assessment

7/10

Significant contribution to evaluation methodology by introducing a realistic, diverse benchmark that exposes major gaps in current LLMs (especially open-source) regarding tabular reasoning.

⚙️ Technical Details

Problem Definition

Setting: Question Answering over Tabular Data (CSV/DataFrame format)

Inputs: A tabular dataset D (CSV file) and a natural language question q

Outputs: A factoid answer a (boolean, number, category, or list) or executable code generating the answer

Pipeline Flow

Input: Dataset (CSV) + Question
Prompting Strategy (Branch A: Z-ICL or Branch B: Code-based)
Model Inference (Generate JSON or Python Code)
Execution/Parsing (Run code or parse JSON)
Evaluation (Compare against Gold Standard)

System Modules

Prompt Generator

Constructs the prompt based on the chosen strategy (Z-ICL or Code-based)

Model or implementation: N/A (Deterministic)

LLM Inference

Generates the response (JSON answer or Python code)

Model or implementation: Evaluated Models: Llama-2 (7B/13B), CodeLlama (7B/13B), ChatGPT-3.5

Response Parser / Executor

Extracts the answer from JSON or executes generated Python code

Model or implementation: Python Interpreter / JSON Parser

Novel Architectural Elements

Dual-track evaluation framework comparing direct reasoning (In-Context) vs. tool-use reasoning (Code Generation) on the same heterogeneous datasets

Modeling

Base Model: Evaluated: Llama-2-7B/13B-Chat, CodeLlama-7B/13B-Instruct, ChatGPT-3.5-Turbo-0613

Training Method: Zero-shot inference only

Adaptation: None (Zero-shot evaluation)

Trainable Parameters: None (Inference only)

Compute: 16 GB M2 2022 Macbook Air (CPU with Metal optimization) for open models; OpenAI API for ChatGPT

Comparison to Prior Work

vs. OpenWikiTables: DataBench includes 65 datasets from diverse domains (not just Wikipedia) and complex answer types (lists, booleans)
vs. TAPEX: DataBench evaluates general-purpose LLMs in zero-shot settings rather than fine-tuning specialized table models
vs. FeTaQA: Focuses on factoid/short answers (boolean, number, list) rather than long-form free-text explanations
+ 1 more
Novel contribution: Systematic comparison of 'Code-based' vs 'In-Context' prompting across heterogeneous real-world data types

Limitations

Evaluation primarily uses DataBench_lite (first 20 rows) due to context window constraints
Restricted to English language datasets
Limited number of models tested (only Llama-2/CodeLlama family and GPT-3.5)
Reliance on strict JSON/Code formatting might conflate reasoning errors with instruction-following errors

Reproducibility

Code: https://huggingface.co/datasets/cardiffnlp/databench

Benchmark publicly available on HuggingFace. Code for evaluation not explicitly linked but dataset repository is provided. Open models used are standard public checkpoints (Llama-2, CodeLlama). Exact prompts provided in paper.

📊 Experiments & Results

Evaluation Setup

Zero-shot QA on DataBench_lite (first 20 rows of 65 datasets)

Benchmarks:

DataBench (Tabular Question Answering) [New]

Metrics:

Accuracy (strict match with relaxed formatting)
Format Error Rate (percentage of unparseable outputs)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall accuracy comparison showing the dominance of closed-source models and code-based prompting.
DataBench_lite	Average Accuracy	33.1	63.0	+29.9
DataBench_lite	Average Accuracy	33.4	63.0	+29.6
Performance breakdown by answer type highlights specific weaknesses in list processing.
DataBench_lite	Accuracy (List[Number])	1.6	56.5	+54.9
DataBench_lite	Accuracy (Boolean)	50.0	52.7	+2.7
Complexity analysis showing performance drop when reasoning over multiple columns.
DataBench_lite	Accuracy (Multiple Cols)	67.0	57.4	-9.6

Experiment Figures

Qualitative examples of model failures for both Z-ICL and Code prompts.

Breakdown of code generation error types (KeyError, TypeError, SyntaxError) by model.

Main Takeaways

Code-based prompting is superior to In-Context Learning for tabular QA, especially for numerical and list-based questions, likely because Python libraries handle the computation better than the LLM's internal reasoning.
Open-source models (Llama-2, CodeLlama) lag significantly behind closed-source models (ChatGPT) in both reasoning accuracy and instruction following (format errors).
Boolean questions remain surprisingly difficult for code-generation approaches, potentially because generating logic verification code is more complex than simple aggregation code.
Models frequently 'hallucinate' the wrong columns to use but sometimes arrive at the correct answer anyway, or correctly identify columns but fail the reasoning step.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs) and prompting strategies
Familiarity with tabular data formats (CSV, Pandas DataFrames)
Knowledge of Python/Pandas for code generation tasks

Key Terms

DataBench_lite: A reduced version of the benchmark containing only the first 20 rows of each dataset to fit within limited LLM context windows

Z-ICL: Zero-shot In-Context Learning—prompting the model with the dataset content directly in the context window without parameter updates

Code-based prompting: Prompting the model to generate executable Python/Pandas code to derive the answer rather than predicting the answer text directly

format error: The percentage of model outputs that fail to parse into the requested JSON or executable code format

hallucination: When a model generates an answer that is factually incorrect or references non-existent data columns

pandas: A popular Python library for data manipulation and analysis, used here as the target language for code generation